Data from the people, about the people! Person-centric data covers a broad range of potential forms and sources. It might be survey data with attitudinal variables, purchase history, number of visits to the doctor’s office, number of alcoholic drinks had during the last seven days, internet browsing behavior, or even Tweets. If we start to consider all of the data that we see, it becomes clear that a great deal of data is person-centric.
Not all data fits into this notion of person-centricity. We often find financial data (think end of day trading) that really does not fit into our notion of person-centric data. Even though this type of stock-flavored financial data does not really fit, the amount of money a person spends certainly does. Furthermore, much public data exists for municipal entities, such as county assessors. It might seem like a properties record is not person-centric, but we could build data that tracks a person’s property buying/selling – we would certainly look at such data and consider it to be behavioral in nature.
What do we do with person-centric, behavioral data? In general, we want to predict or explain behavior in some way! To do that, we need to create some hypotheses. We must, however, be very careful in crafting our hypotheses. While we can do some fancy analyses that start to lead us down a more causal path, we can never fully generalize what makes people do the things that they do.
This will be a theme that is revisited during our time together. Just so that you can start to fully get your head around hypotheses, let’s take the first try at a little game.
I am noticing that visitors to my site spend less time on one page compared to other pages. Here is what that data might look like:
visitorID pageID secondsOnPage
1 1 a 16.152030
2 1 b 15.357179
3 1 c 14.385829
4 2 a 21.789992
5 2 b 13.612962
6 3 a 11.147774
7 3 b 14.483975
8 3 c 8.654847
9 3 d 16.865599
I have a visitor identification number, a page identifier, and the number of seconds each user spent on a given page. Do I have data that would allow me to test if there is a difference in the amount of time that people spend on a page? I certainly do! With the data that you see, could I determine exactly why there might be differences in time on page? I would say rather unlikely with the provided data. We always need to make sure that we are not speaking beyond our data!
This was just a demo version of the game – you will unlock more difficult levels as we progress.
Think back to your linear models course. Do you remember examining adjusted \(R^2\)? It is one of the more popular measures of how well a particular model explains the data (adjusted \(R^2\) is noted as how much variance in the dependent variable is explained by the predictor variables in the equation). We traditionally are pushed towards looking for very high values. Students come from a variety of statistical traditions have been taught a great many “thresholds” for acceptable \(R^2\) values – anywhere from .4 to .7 can be mentioned all in the span of a few seconds. If, however, we are trying to explain what someone does (i.e., a particular behavior), what are the chances that we will obtain such adjusted \(R^2\) values? Probably not very high, right? People are messy and often unpredictable, so achieving huge \(R^2\) values is unlikely.
Does a small \(R^2\) mean that your model is garbage‽ Of course not! A small \(R^2\) does absolutely nothing to take away from the relationships between your predictor variables and your outcome. A low \(R^2\) might mean that you have more error in your prediction, but there are still better ways of figuring that out than just looking at one value and giving up.
This is not to suggest that we should accept an \(R^2\) of .000001 as the most predictive model in the world (note – it probably is not); however, we should not be terribly dejected to see a model with 4 or 5 predictors with an \(R^2\) around .2. That means that we have taken one of nature’s sloppiest and most unpredictable creatures, and explained 20% of the variance of a behavior.
With that aside on \(R^2\) out of the way, we can think more generally about analyzing behavioral data (that is why we are here). The Behavioral Sciences have a rich tradition of interesting analyses: from experimental design to causal models and all points in between. When dealing with behavioral data, we are frequently looking at some type of latent variables; these are constructs we think exist, but have no way to physically touch them. We are going to be dealing in the latent world for the majority of our time together, so start thinking about latent variables now!
Person-centric data comes in many different flavors.
There is the classic cross-sectional, wide format (tends towards a vanilla taste):
id satisfaction leftTip
1 1 1 1
2 2 4 1
3 3 3 1
4 4 4 0
5 5 2 1
We also have repeated measures:
id observation satisfaction leftTip rating
1 1 1 1 1 3
2 1 2 5 0 7
3 1 3 5 0 6
4 2 1 1 0 2
5 2 2 2 1 3
6 2 3 4 1 5
7 3 1 5 1 7
8 3 2 1 1 3
9 3 3 1 1 3
And. key-value pairs:
variable value
1 id 1
2 id 1
3 id 1
4 id 2
5 id 2
6 id 2
7 id 3
8 id 3
9 id 3
10 observation 1
11 observation 2
12 observation 3
13 observation 1
14 observation 2
15 observation 3
16 observation 1
17 observation 2
18 observation 3
19 satisfaction 1
20 satisfaction 5
21 satisfaction 5
22 satisfaction 1
23 satisfaction 2
24 satisfaction 4
25 satisfaction 5
26 satisfaction 1
27 satisfaction 1
28 leftTip 1
29 leftTip 0
30 leftTip 0
31 leftTip 0
32 leftTip 1
33 leftTip 1
34 leftTip 1
35 leftTip 1
36 leftTip 1
37 rating 3
38 rating 7
39 rating 6
40 rating 2
41 rating 3
42 rating 5
43 rating 7
44 rating 3
45 rating 3
And then from there, we can start to do combinations:
id variable value
1 1 observation 1
2 1 observation 2
3 1 observation 3
4 2 observation 1
5 2 observation 2
6 2 observation 3
7 3 observation 1
8 3 observation 2
9 3 observation 3
10 1 satisfaction 1
11 1 satisfaction 5
12 1 satisfaction 5
13 2 satisfaction 1
14 2 satisfaction 2
15 2 satisfaction 4
16 3 satisfaction 5
17 3 satisfaction 1
18 3 satisfaction 1
19 1 leftTip 1
20 1 leftTip 0
21 1 leftTip 0
22 2 leftTip 0
23 2 leftTip 1
24 2 leftTip 1
25 3 leftTip 1
26 3 leftTip 1
27 3 leftTip 1
28 1 rating 3
29 1 rating 7
30 1 rating 6
31 2 rating 2
32 2 rating 3
33 2 rating 5
34 3 rating 7
35 3 rating 3
36 3 rating 3
All of these have their place. Many analyses require traditional wide data. Others, like panel models, require long data (e.g., our repeated measures data).
library(plm)
panelModel = plm(rating ~ satisfaction, index = c("id", "observation"),
data = repeatedMeasures)
summary(panelModel)Oneway (individual) effect Within Model
Call:
plm(formula = rating ~ satisfaction, data = repeatedMeasures,
index = c("id", "observation"))
Balanced Panel: n = 3, T = 3, N = 9
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.598291 -0.068376 -0.017094 0.136752 0.401709
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
satisfaction 0.94872 0.06784 13.985 3.362e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 24
Residual Sum of Squares: 0.59829
R-Squared: 0.97507
Adj. R-Squared: 0.96011
F-statistic: 195.571 on 1 and 5 DF, p-value: 3.3615e-05
Sometimes, we need to reshape our data for the sake of visualization (such visualizations are especially handy for survey questions).
id satisfaction1 satisfaction2 satisfaction3
1 1 7 3 4
2 2 7 3 7
3 3 5 5 6
If we want to show several people’s survey responses, we need to do some reshaping:
See if you can reproduce the data and visualization! Just as a hint, you will need to make your wide data into long data (melt from reshape2 or gather from tidyr should do the trick).
No matter the need, don’t be afraid to shape your data in a way that conforms to your needs. Always remember that you are in control of your data – your data is not in control of you! With various packages from base, the tidyverse, and beyond, you have the power to make your data be whatever you need it to be. While you cannot control the sloppy data that people produce, you can control the data format.
When dealing with person-centric data, we need to always be thinking about ethics. Are we treating the data in a secure manner and honoring people’s privacy? Are we being honest to people about how we are going to handle their data? These are questions that require your care and attention.
This becomes even more important when sharing data. Look at the following data and think about whether or not it should be shared as is:
data.frame(age = sample(18:35, 8, replace = TRUE),
gender = sample(c("male", "female"), 8, replace = TRUE),
department = sample(c("cs", "hr", "it", "accounting"), 8, replace = TRUE),
score = sample(50:100, 8)) age gender department score
1 25 female it 81
2 21 male accounting 92
3 24 male cs 77
4 32 male cs 87
5 19 female accounting 51
6 23 female it 63
7 33 male it 78
8 18 male accounting 61
There are not any personal identifiers here, so we are good to put it on Github or Google Drive and share away…right? Let’s pick this apart a little bit. Any of the demographic-inclined variables, in isolation, are not a cause for any concern; when we have all of them together, though, we might be able to identify individuals through “triangulation”. This becomes especially true when there are groups with fewer individuals. Triangulation is a real concern with any type of data, but we need to be diligent in good security and privacy practices when we are dealing with person-centric data. This gets into an area known as de-identification – the removal of not only concrete identifiers (e.g., names, SSNs), but also anything that can be pieced together to identify a person. So, what items would we need to remove from our data to make it de-identified?
It might seem far fetched that anyone would ever go through the trouble of identifying your data, but you should think conspiratorially when it comes to your data – imagine the absolute worst thing that could happen if someone took it and identified it. Could it cause personal problems, job lose, or reveal something embarrassing/illegal?
Remember, the 4Chan collective found a flag in rural Tennessee using contrail patterns, star alignment, and a car horn – in 37 hours. Don’t forget that they also contributed to the identification and subsequent airstrike of an ISIS training facility. Just imagine what could be done with your interesting data.
It really does not exist – ever. If you are in charge of collecting any type of person-centric data, you can never promise complete anonymity; doing so is truly an amateur move and demonstrates a misunderstanding of behavioral data. To guarantee anonymity is to guarantee that there is no possible way that a person could be identified – this is almost never the case. You may, on the other hand, promise confidentiality. That means that you will take every possible precaution to ensure that a person’s data is not broadcast or imprudently shared.
We see many different types of data when working with behavioral data, well beyond what we typically think about with our different levels of measurement (nominal, ordinal, interval, and ratio).
Survey data is where we typically find attitudinal data and it tends to present some interesting issues. How are the items measured – are they traditional Likert response options (strongly disagree to strongly agree) or are they some type of feelings thermometers (typically some crazy range that could be anywhere between 0 to 100 or -100 to 100). Perhaps it is a Likert-type (Not at all to Very much). How many response options are there – 3, 5, 7, or more?
When dealing with such attitudinal data, we might see data that takes any of the following forms:
library(magrittr)
data.frame(agreementNumeric = 1:5,
agreementOrdered = ordered(1:5, labels = c("Strongly disagree",
"Disagree", "Neither disagree nor agree",
"Agree", "Strongly agree")),
agreementCharacter = c("Strongly disagree",
"Disagree", "Neither disagree nor agree",
"Agree", "Strongly agree"),
pseudoLikert = c("Not at all", "A little", "Some", "A lot", "Very much")) %T>%
glimpse() %>%
summary()Observations: 5
Variables: 4
$ agreementNumeric <int> 1, 2, 3, 4, 5
$ agreementOrdered <ord> Strongly disagree, Disagree, Neither disagr...
$ agreementCharacter <fct> Strongly disagree, Disagree, Neither disagr...
$ pseudoLikert <fct> Not at all, A little, Some, A lot, Very much
agreementNumeric agreementOrdered
Min. :1 Strongly disagree :1
1st Qu.:2 Disagree :1
Median :3 Neither disagree nor agree:1
Mean :3 Agree :1
3rd Qu.:4 Strongly agree :1
Max. :5
agreementCharacter pseudoLikert
Agree :1 A little :1
Disagree :1 A lot :1
Neither disagree nor agree:1 Not at all:1
Strongly agree :1 Some :1
Strongly disagree :1 Very much :1
This is where data familiarization and exploration becomes important – always take note of variable types. If you are receiving this data from another source, you never know what you might be getting.
With the data as it is in its current form, think about what you might like to do with it. Would you like to plot it or create a correlation matrix? Can you make that happen? Data processing becomes very important when working through any survey-based data.
What functions would you use to convert everything to numeric?
We might run into data that is censored. While it might sound like something exciting at first, censored variables are simply those that we do not know about the values before or after data collection started/ended. You may run into variables that have been discretized – a continuous variable, such as age, broken down into discreet categories. When we get that age variable, we might see categories like “<25”, “25-35”, “36-45”, “46-55”, and “55+”. When people have an age of “<25”, we have left censored data – we don’t know how far to the “left” the value actually is, we just know that it is somewhere under 25. For our observations with “55+”, we have right censored data – we don’t know how far to the right the value actually is.
ageCategory censored
1 <25 Left censored
2 25-35 Uncensored
3 36-45 Uncensored
4 46-55 Uncensored
5 55+ Right censored
While we won’t be diving into them, tobit models are one such way for dealing with censored dependent variables.
The following is similar to an example from the AER package. The max number of reported affairs is 12, but that was at the time of reporting – the philandering partners could have engaged in more affairs after data was collected. To that end, we need to account for right censoring.
library(AER)
data("Affairs")
fm.tobit = tobit(affairs ~ age + yearsmarried + religiousness,
right = 12, data = Affairs)
summary(fm.tobit)
Call:
tobit(formula = affairs ~ age + yearsmarried + religiousness,
right = 12, data = Affairs)
Observations:
Total Left-censored Uncensored Right-censored
601 451 112 38
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.30266 3.04944 -0.099 0.9209
age -0.23709 0.11150 -2.126 0.0335 *
yearsmarried 0.92957 0.19594 4.744 2.09e-06 ***
religiousness -2.59303 0.58544 -4.429 9.46e-06 ***
Log(scale) 2.45733 0.08253 29.777 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Scale: 11.67
Gaussian distribution
Number of Newton-Raphson Iterations: 4
Log-likelihood: -661.8 on 5 Df
Wald-statistic: 36.75 on 3 Df, p-value: 5.1855e-08
If you find yourself on the planning end of some broader data collection efforts, absolutely resist the temptation (both internal or external) to discretize data. Some people will say, “But response rates…blah…blah…blah”, or, “We have to make it easy.”, or, “Think of the children.” Consider the following data:
age
1 <25
2 25-35
3 36-45
4 36-45
5 46-55
6 55+
You have 6 respondents – what is the mean age of those respondents? I have no idea and neither do you! If you allow data to be collected this way, you are immediately imposing limits on what you might do with that data in the future. Someone will likely say, “Just take the middle point of the range and call that the response.”
age
1 25
2 30
3 40
4 40
5 50
6 55
mean(age)
1 40
Great.~ We have an average now. What if the actual ages were as follows:
age
1 21
2 30
3 36
4 45
5 52
6 74
mean(age)
1 43
What is your mean age now? Is it still 40? Will it matter for your analyses? You need to decide if the difference in the answer is important or not. Maybe you decide it is not that different and are comfortable doing this. Let me pose an additional question: is a 25 year old the same as a 35 year old? Just think about that for your own self: is 18 year old you the same as 24 year old you? This becomes an even bigger issue when we have ranges with unequal intervals (commonly seen in income brackets that might start with a few thousand dollars in range, but end with upwards of 100K dollars or more in range).
And here is another common issue: someone will want to discretize a variable and use it as an outcome variable in a model. Such shenanigans is common with discretized income variables. You have the power to make that happen (think back to our friend ordered logistic regression from generalized linear models). To the person who was expecting a standard linear regression, they are going to be really confused when you show them more than 1 intercept!
Remember, it is okay to make things easy on people, but give them some credit. Most people know how old they are – if they won’t tell you their exact age, they probably won’t be inclined to tell you the range either.
Also known has multilevel or nested data, hierarchical data is best defined by a classical example – we could have students, nested within classrooms, nested within schools, nested within districts, and so on.
Here is what some nested data might look like:
id subgroup group
1 1 1 1
2 2 1 1
3 3 1 1
4 4 2 1
5 5 2 1
6 6 2 2
7 7 3 2
8 8 3 2
9 9 3 2
10 10 4 2
11 11 4 3
12 12 4 3
13 13 5 3
14 14 5 3
15 15 5 3
We have an id variable, which is giving us an individual identifier. Each person is in 1 of 5 subgroups and each subgroup is in 1 of 3 groups.
Depending on your needs, this structure can be utilized in your analyses (we are going to be getting into it within the next few weeks).
Repeated measures can be thought of in the same way as longitudinal data – we are measuring the same thing for a person for an extended period of time. If you want, you can even think of it as nested data; the time-based observations are nested within an individual.
Here is how that data will typically look:
data.frame(id = rep(1:3, each = 3),
observation = rep(1:3, each = 3),
responseTimeSeconds = sample(1:5, 9, replace = TRUE)) id observation responseTimeSeconds
1 1 1 1
2 1 1 2
3 1 1 5
4 2 2 5
5 2 2 1
6 2 2 4
7 3 3 3
8 3 3 4
9 3 3 1
This data shows us 3 different people, each with 3 different observations.
Welcome to the now! With people generating volume after volume of text, we have many questions that we can explore.
We can also have a bit of exploratory fun!
You can likely imagine the source, which gets back to our conversation last week about ethics and confidentiality. A great deal of data is public and out there for the taking, but not all of it is.
When we are dealing in behavioral data, there are many places that we can typically find it. Many are obvious, like surveys. Others, however, might require a little bit more imagination.
Self-reported surveys might be the classic source for behavioral data. With the rise or web-based surveys, collecting data has never been easier. Although response rates have been declining over the years, it is still easy to amass a reasonably-sized sample (especially if you can pay participants). In addition to collecting your own data, there are numerous big surveys that make the data available to the public (e.g., Behavioral Risk Factor Surveillance System and National Survey on Drug Use and Health).
Surveys provide data that people explicitly offer, but what about the behaviors that go into survey esponding (or any behaviors that lead up to an action). Paradata and metadata offer such data. While both relate more to the “hidden” data that people produce, they are a bit different. Metadata is all of the little stuff that happens when people use a site – login times, time on page and navigation, and other data related to use. What was once thought of in the context of person-to-person surveys, client-side paradata (CSP) details the small things that people do. Mouse movements, clicks, scrolling, and related variables can all be collected and analyzed.
Records tend to be a bit tricky, but they are out there (in fact, a great many organizations make money hand over fist by pulling public records together). While you might be granted to keys to kingdom and be given access to such databases, you can pull public records from many different places: county assessors websites and county courthouse websites are two such examples (jailtracker is fun to use). Many cities across America are also diving into the open data game and providing a great wealth of data. While it might not all be useful, there is certainly enough there to begin thinking about interesting questions.
Although we have identified our hypotheses and gotten our data into shape, that does not mean that we should just go right into hypothesis testing.
We can use the power of R to explore our data in many different ways. Perhaps you would like to dive right in and examine the relationships within your data. You might be tempted to construct a correlation matrix:
corDat = data.frame(item1 = sample(1:7, 50, replace = TRUE),
item2 = sample(1:7, 50, replace = TRUE),
item3 = sample(1:7, 50, replace = TRUE),
item4 = sample(1:7, 50, replace = TRUE),
item5 = sample(1:7, 50, replace = TRUE))
cor(corDat) item1 item2 item3 item4 item5
item1 1.00000000 -0.133509603 -0.15432422 -0.097453039 -0.48378554
item2 -0.13350960 1.000000000 0.13222542 -0.006078968 -0.12516972
item3 -0.15432422 0.132225415 1.00000000 -0.325306407 0.07113753
item4 -0.09745304 -0.006078968 -0.32530641 1.000000000 0.11022161
item5 -0.48378554 -0.125169722 0.07113753 0.110221608 1.00000000
Quickly, where are the strongest correlations? It probably took a few seconds to process the whole correlation matrix; now imagine doing that with many variables.
This is where corrplot (or corrgrams to some) come in handy.
corrplot::corrplot(cor(corDat))This gives us a much quicker way to find the strong relationships amongst our variables. It can also give us a good idea about how items might shake out into components:
set.seed(5)
df = data.frame(item1 = sample(1:7, 10, replace = TRUE),
item2 = sample(1:7, 10, replace = TRUE),
item3 = sample(1:7, 10, replace = TRUE),
item4 = sample(1:7, 10, replace = TRUE),
item5 = sample(1:7, 10, replace = TRUE),
item6 = sample(1:7, 10, replace = TRUE),
item7 = sample(1:7, 10, replace = TRUE),
item8 = sample(1:7, 10, replace = TRUE),
item9 = sample(1:7, 10, replace = TRUE),
item10 = sample(1:7, 10, replace = TRUE),
item11 = sample(1:7, 10, replace = TRUE),
item12 = sample(1:7, 10, replace = TRUE),
item13 = sample(1:7, 10, replace = TRUE),
item14 = sample(1:7, 10, replace = TRUE),
item15 = sample(1:7, 10, replace = TRUE))
cor(df) %>%
corrplot::corrplot(., order = "FPC")Knock Knock
Who’s There?
Orange
Orange Who?
Orange you glad that you did not need to look through a 15 by 15 correlation matrix to find the strongest relationships?
While it might take you a bit to appreciate statistical humor, you will immediately see benefits to such plots.
You can even make them more art than utility:
df %>%
cor() %>%
as.matrix() %>%
reshape2::melt(.) %>%
ggplot(., aes(Var1, Var2, fill = value)) +
geom_raster(interpolate = TRUE) +
scale_fill_gradient(low = "#00ffd5", high = "#ff5500") +
theme_minimal()Not only can you quickly explore your data for hotspots, you can also create desktop backgrounds.
Heatmaps can be used in conjunction with distance measures to tell how similar observations might be. Instead of looking at a massive distance matrix, we can compute the distance matrix, and pass it along to our visualization to see how similar (or not) these 50 people are to each other (this gets into clustering, which we will be seeing later in the course).
Look at this visualization:
You see the relationship, don’t you?
Hopefully, you aren’t seeing much of anything – these are just two random normal distributions plotted. If you did start to see some pattern, don’t worry – this just means that you are indeed a human being. We have a natural flaw that makes use see patterns when they are not really there; this is why people often see Jesus in chips and on their toast.
Let’s try this out.
Do any of those visualizations pop out to you? In other words, is there one that is absolutley not like the others? Not really. However, one of those visualizations contains the actual data – the other three have random data for one of the variables. Hopefully, you didn’t trick yourself into seeing anything terribly strong.
Let’s try it with something that might have slightly stronger relationships:
Now, can you spot the real relationship in this one? Probably!
Let’s bump the distractors now:
How’s that? Is it easy to spot the one that is markedly different?
There is a point to all of this – don’t get fooled by false patterns. Just because we are studying human behavior and the data it produces, does not make us immune from the trappings of being human.
“A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. With consistency a great soul has simply nothing to do. He may as well concern himself with his shadow on the wall. Speak what you think now in hard words, and to-morrow speak what to-morrow thinks in hard words again, though it contradict every thing you said to-day. — ‘Ah, so you shall be sure to be misunderstood.’ — Is it so bad, then, to be misunderstood? Pythagoras was misunderstood, and Socrates, and Jesus, and Luther, and Copernicus, and Galileo, and Newton, and every pure and wise spirit that ever took flesh. To be great is to be misunderstood.”
Emerson’s (Ralph Waldo, not Keith) beautiful words are great if we are trying to expand our minds and engage in learning in general. One place that we cannot use this, however, is in connecting our data and hypotheses. If we have hypotheses about data, we had better have the data to test those hypotheses. While this may seem intuitive, behavioral data can often present unforeseen challenges. For example, you might be given some data full of attitudinal measures and someone has some ideas about relationships they want to test. If said person is not fully in the know about how to construct and test hypotheses, he/she might come to you wanting to test hypotheses that do not make sense given the data (maybe they want to “say” something about a scale that is not really measured by the scale).
If I asked you your height, could you tell me? How do you know your height and what allowed you to know your height? Like most, you have probably been measured at the doctor’s office with some device marked with inches.
Now, if I were to ask you how you felt, what would you tell me? You might say that you are feeling well or not feeling too well, but how did you measure this? Did you use your wellness measuring tape or your wellness balance beam – unlikely. Instead, you likely thought about everything going on – your current health, general mood, and whatever else you deemed important at the time. Put another way, I could essentially be asking your about your affect (mood). Unfortunately, we can’t measure your affect like we can measure your height. There is no agreed upon method for measuring your current affect (there isn’t even an argument between Imperial and metric to be had).
Affect is an example of a latent variable – we have a pretty good idea that it exists and we have operationalized it, but there is no way that we can physically measure it. If we think of what latent tends to mean, hidden, we start to understand what latent variables are – variables that are hidden, but still interesting. Instead, we have to rely on a series of questions that gets us close to what we think is affect. This process of “getting at” affect through questions is what is known as measurement. You have been involved with measurement for a very long time, even if you did not know it. If you ever took a high-stakes standardized test (e.g., GRE, GMAT, LSAT), then you were directly involved in measurement – we can’t actually measure the mass of your mathematical reasoning ability, so we have to ask questions to measure that ability.
An important part of measurement concerns error. Error can come from many different sources (measure, person, environment, etc.) and “distorts” the observed score.
There might be a feeling of dejavu creeping in here. You have already learned about a technique used for taking items and reducing them down – principal components analysis (PCA). The logical question is are factor analysis and pca the same things? They are both matrix factorization techniques, but the similarities start to end pretty quickly after that.
Causal direction: In PCA, the items are the cause of the component (think how ingredients combine to make a pancake – the pancake exists because the ingredients exist). In factor analysis, the latent factor gives rise to the items (i.e., the items reflect the construct – questions for mathematical reason exist because there is a construct for mathematical reasoning that exists beyond the existence of the items).
Component/Factor Interpretation: These is no interpretation of a component – it simply exists to reduce the variables. Factors represent the latent variables are important.
Variance: PCA attempts to extract as much variance as possible from the items. Each succesive component after the first will extract less variance than the first
Measurement error: PCA assumes perfect measurement
The psyc package will make life easy for us. Let’s see what these differences mean in terms of our output:
library(dplyr)
library(psych)
testData = bfi %>%
select(starts_with("A", ignore.case = FALSE),
starts_with("C"))
testPCA = principal(r = testData, nfactors = 2, rotate = "none")
testFA = fa(r = testData, nfactors = 2, rotate = "none")
testPCAPrincipal Components Analysis
Call: principal(r = testData, nfactors = 2, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 h2 u2 com
A1 -0.32 -0.45 0.31 0.69 1.8
A2 0.59 0.48 0.59 0.41 1.9
A3 0.61 0.51 0.63 0.37 1.9
A4 0.56 0.26 0.38 0.62 1.4
A5 0.56 0.43 0.51 0.49 1.9
C1 0.49 -0.47 0.46 0.54 2.0
C2 0.58 -0.44 0.52 0.48 1.9
C3 0.55 -0.37 0.44 0.56 1.8
C4 -0.60 0.41 0.53 0.47 1.8
C5 -0.58 0.34 0.46 0.54 1.6
PC1 PC2
SS loadings 3.02 1.79
Proportion Var 0.30 0.18
Cumulative Var 0.30 0.48
Proportion Explained 0.63 0.37
Cumulative Proportion 0.63 1.00
Mean item complexity = 1.8
Test of the hypothesis that 2 components are sufficient.
The root mean square of the residuals (RMSR) is 0.09
with the empirical chi square 2262.93 with prob < 0
Fit based upon off diagonal values = 0.86
testFAFactor Analysis using method = minres
Call: fa(r = testData, nfactors = 2, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 h2 u2 com
A1 -0.26 -0.29 0.15 0.85 2.0
A2 0.55 0.40 0.46 0.54 1.8
A3 0.59 0.47 0.56 0.44 1.9
A4 0.47 0.18 0.25 0.75 1.3
A5 0.51 0.34 0.37 0.63 1.7
C1 0.42 -0.36 0.31 0.69 2.0
C2 0.51 -0.37 0.40 0.60 1.8
C3 0.47 -0.29 0.31 0.69 1.7
C4 -0.53 0.37 0.42 0.58 1.8
C5 -0.51 0.29 0.34 0.66 1.6
MR1 MR2
SS loadings 2.40 1.18
Proportion Var 0.24 0.12
Cumulative Var 0.24 0.36
Proportion Explained 0.67 0.33
Cumulative Proportion 0.67 1.00
Mean item complexity = 1.8
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 45 and the objective function was 2.03 with Chi Square of 5664.89
The degrees of freedom for the model are 26 and the objective function was 0.17
The root mean square of the residuals (RMSR) is 0.04
The df corrected root mean square of the residuals is 0.05
The harmonic number of observations is 2762 with the empirical chi square 397.07 with prob < 5e-68
The total number of observations was 2800 with Likelihood Chi Square = 468.37 with prob < 1.2e-82
Tucker Lewis Index of factoring reliability = 0.864
RMSEA index = 0.078 and the 90 % confidence intervals are 0.072 0.084
BIC = 261.99
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy
MR1 MR2
Correlation of (regression) scores with factors 0.9 0.82
Multiple R square of scores with factors 0.8 0.67
Minimum correlation of possible factor scores 0.6 0.34
While items might load on the same factor/component, the magnitudes are much different. We also see differences in communality (h2 – item correlations with all other items) and item uniqueness (u2 – variance that is unique to the item and not the factor/component). We can also see that our PCA extracted more variance from the items than did our EFA (.48 to .36).
Determining the appropriate number of factors in an exploratory factor analysis can be complex. How many might theory dictate? Is there a strong theory at all? In an exploratory setting, it is helpful to conduct something called a parallel analysis (PA). PA is a Monte Carlo simulation that takes into account the number of items and rows within your data and then produces a random matrix of the same shape.
psych::fa.parallel(testData)Parallel analysis suggests that the number of factors = 4 and the number of components = 2
This is the most automated way of finding a potentially suitable number of factors. We are given some output appropriate for both factor analysis and principal components analysis, with a graphical representation provided. In this case, parallel analysis would suggest 4 factors to be retained for a factor analysis. If some theory exists, you can also use that information to guide your factor numbers.
While you will often get some different results, you can also use the nfactors function:
testData %>%
na.omit() %>%
cor() %>%
psych::nfactors()
Number of factors
Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm,
n.obs = n.obs, plot = FALSE, title = title, use = use, cor = cor)
VSS complexity 1 achieves a maximimum of 0.67 with 2 factors
VSS complexity 2 achieves a maximimum of 0.78 with 5 factors
The Velicer MAP achieves a minimum of 0.03 with 2 factors
Empirical BIC achieves a minimum of -53.47 with 3 factors
Sample Size adjusted BIC achieves a minimum of -8.49 with 4 factors
Statistics by number of factors
vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC
1 0.54 0.00 0.046 35 8.4e+02 1.7e-154 7.4 0.54 0.152 602 712.9
2 0.67 0.73 0.034 26 1.6e+02 7.4e-22 4.3 0.73 0.073 -16 66.3
3 0.57 0.74 0.057 18 9.0e+01 1.4e-11 3.8 0.76 0.064 -34 23.0
4 0.62 0.77 0.083 11 3.3e+01 6.2e-04 3.1 0.81 0.045 -43 -8.5
5 0.61 0.78 0.118 5 1.3e+01 2.6e-02 2.5 0.85 0.040 -22 -6.0
6 0.55 0.76 0.185 0 2.0e-01 NA 2.3 0.86 NA NA NA
7 0.49 0.70 0.287 -4 2.6e-08 NA 2.5 0.84 NA NA NA
8 0.46 0.71 0.453 -7 1.5e-08 NA 2.5 0.84 NA NA NA
9 0.48 0.72 1.000 -9 0.0e+00 NA 2.5 0.84 NA NA NA
10 0.48 0.72 NA -10 0.0e+00 NA 2.5 0.84 NA NA NA
complex eChisq SRMR eCRMS eBIC
1 1.0 1.4e+03 1.2e-01 0.141 1154
2 1.1 1.4e+02 3.9e-02 0.051 -42
3 1.4 7.1e+01 2.8e-02 0.044 -53
4 1.3 2.4e+01 1.6e-02 0.033 -52
5 1.4 8.5e+00 9.7e-03 0.029 -26
6 1.6 1.2e-01 1.2e-03 NA NA
7 1.9 1.5e-08 4.1e-07 NA NA
8 1.9 8.6e-09 3.1e-07 NA NA
9 1.9 3.2e-17 1.9e-11 NA NA
10 1.9 3.2e-17 1.9e-11 NA NA
Now we are provided a variety of different metrics for determining the number of factors to retain. We can see that the suggested number of factors range from 2 (very simple structure with complexity 1 and MAP) to 5 (very simple structure with complexity 2). Given the results of our parallel analysis and our sample size adjusted BIC, we could make a good argument for retaining 4 factors.
In some previous courses, you learned about PCA and eigenvalues/eigenvectors. You might remember how we rotate axes to try to get our components to fit together better. Generally, PCA forces an orthogonal rotation – in other words, the vertices will always maintain 90s. Factor analysis, will allow for orthogonal rotations, but also oblique rotations (the vertices can go above or below 90). While pictures certainly help, there is something very important at play here. As you remember from PCA, an orthogonal rotation would hold that the factors (or components) are not correlated. For components, this makes sense. For our scale, however, would you guess that the items being analzyed are not correlated? By using an oblique rotation, we are allowing the factors to be as correlated as necessary.
Our rotations, though not terribly different, will produce different loading patterns.
orthRotation = fa(r = testData, nfactors = 2, rotate = "varimax")
obliqueRotation = fa(r = testData, nfactors = 2, rotate = "promax")
orthRotationFactor Analysis using method = minres
Call: fa(r = testData, nfactors = 2, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 h2 u2 com
A1 0.01 -0.39 0.15 0.85 1.0
A2 0.12 0.67 0.46 0.54 1.1
A3 0.10 0.74 0.56 0.44 1.0
A4 0.22 0.45 0.25 0.75 1.4
A5 0.13 0.60 0.37 0.63 1.1
C1 0.55 0.03 0.31 0.69 1.0
C2 0.63 0.08 0.40 0.60 1.0
C3 0.54 0.11 0.31 0.69 1.1
C4 -0.64 -0.10 0.42 0.58 1.1
C5 -0.57 -0.14 0.34 0.66 1.1
MR1 MR2
SS loadings 1.81 1.77
Proportion Var 0.18 0.18
Cumulative Var 0.18 0.36
Proportion Explained 0.51 0.49
Cumulative Proportion 0.51 1.00
Mean item complexity = 1.1
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 45 and the objective function was 2.03 with Chi Square of 5664.89
The degrees of freedom for the model are 26 and the objective function was 0.17
The root mean square of the residuals (RMSR) is 0.04
The df corrected root mean square of the residuals is 0.05
The harmonic number of observations is 2762 with the empirical chi square 397.07 with prob < 5e-68
The total number of observations was 2800 with Likelihood Chi Square = 468.37 with prob < 1.2e-82
Tucker Lewis Index of factoring reliability = 0.864
RMSEA index = 0.078 and the 90 % confidence intervals are 0.072 0.084
BIC = 261.99
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy
MR1 MR2
Correlation of (regression) scores with factors 0.85 0.86
Multiple R square of scores with factors 0.72 0.75
Minimum correlation of possible factor scores 0.45 0.49
obliqueRotationFactor Analysis using method = minres
Call: fa(r = testData, nfactors = 2, rotate = "promax")
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 h2 u2 com
A1 0.08 -0.41 0.15 0.85 1.1
A2 0.00 0.68 0.46 0.54 1.0
A3 -0.03 0.76 0.56 0.44 1.0
A4 0.14 0.44 0.25 0.75 1.2
A5 0.03 0.60 0.37 0.63 1.0
C1 0.57 -0.07 0.31 0.69 1.0
C2 0.64 -0.02 0.40 0.60 1.0
C3 0.54 0.02 0.31 0.69 1.0
C4 -0.65 0.01 0.42 0.58 1.0
C5 -0.56 -0.05 0.34 0.66 1.0
MR1 MR2
SS loadings 1.81 1.77
Proportion Var 0.18 0.18
Cumulative Var 0.18 0.36
Proportion Explained 0.51 0.49
Cumulative Proportion 0.51 1.00
With factor correlations of
MR1 MR2
MR1 1.00 0.34
MR2 0.34 1.00
Mean item complexity = 1
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 45 and the objective function was 2.03 with Chi Square of 5664.89
The degrees of freedom for the model are 26 and the objective function was 0.17
The root mean square of the residuals (RMSR) is 0.04
The df corrected root mean square of the residuals is 0.05
The harmonic number of observations is 2762 with the empirical chi square 397.07 with prob < 5e-68
The total number of observations was 2800 with Likelihood Chi Square = 468.37 with prob < 1.2e-82
Tucker Lewis Index of factoring reliability = 0.864
RMSEA index = 0.078 and the 90 % confidence intervals are 0.072 0.084
BIC = 261.99
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy
MR1 MR2
Correlation of (regression) scores with factors 0.86 0.88
Multiple R square of scores with factors 0.75 0.77
Minimum correlation of possible factor scores 0.49 0.54
We can see that our oblique rotation produces a correlation of .34 between the two factors. I will leave it up to you to determine if that is a strong enough correlation to warrant an orthogonal rotation. An orthogonal rotation, while limiting the correlation between factors, does produce “cleaner” results. If the factors are not correlated, we can interpret each one in isolation. In an oblique rotation with correlated factors, we have to interpret the factors simultaneously and consider how items loading on multiple factors behave.
We can think of factor loadings as the correlation between an item and a factor – they are interpreted in this manner.
Let’s take a peak at the factor loadings of our oblique solution from above:
obliqueRotation$loadings
Loadings:
MR1 MR2
A1 -0.407
A2 0.678
A3 0.761
A4 0.144 0.437
A5 0.603
C1 0.574
C2 0.640
C3 0.544
C4 -0.652
C5 -0.564
MR1 MR2
SS loadings 1.806 1.768
Proportion Var 0.181 0.177
Cumulative Var 0.181 0.357
We can see that our output is hiding values of small magnitude (refer to the previous results for all values) – this is for ease in examining the loading matrix and we could consider these as neglible loadings given the weakness of the loading. We can read these very much in the same way that we read correlations. For factor 1 (MR1 in the output), we see that items C1 through C5 load pretty strongly (with A4 having a weak loading). So, someone who offers a high response on C5 might have lower values of whatever latent trait is being measured, while they might have high values of the latent trait if they respond with a higher value on question C2.
We have gone through the necessary steps of performing our factor analysis to this point, but what do we ultimately get out of it? In the end, we want to take our factors and produce some type of score.
What kind of score should we produce? If we use the numeric values given by the observed variables (e.g., 1-5, 0-4), then we can imagine producing some type of aggregate score (i.e., a sum or an average score). If you ever took an undergraduate Psychology course, you have probably already done something like this:
testData = testData %>%
rowwise() %>%
mutate(agreeAverage = (sum(A1, A2, A3, A4, A5, na.rm = TRUE) / 5),
consAverage = (sum(C1, C2, C3, C4, C5, na.rm = TRUE) / 5))Do those aggregate scores tell the complete truth? Remember how we just talked about loadings? The factor loadings are essentially telling us how much the variable is related to the factor. If we use a simple aggregate score, is that relationship captured? I am afraid not; aggregate scores are going to give the same weight to every item, regardless of how highly it loads on any given factor. What about cross-loadings? This type of aggregate scoring would completely ignore them!
Instead, we can use factor scores. There are several different types of factor scores that we can compute, but we are going to compute Thurstone scores.
Thurstone score are incredibly easy to produce by hand, so let’s give it a try.
To compute those scores, we need to produce a correlation matrix of our observed scores. For simplicity, let see what a one factor model would look like.
agreeDat = bfi %>%
select(starts_with("A", ignore.case = FALSE))
agreeCorrs = cor(agreeDat, use = "pairwise")Then, we need to get the loadings from our factor analysis results:
agreeFA = fa(agreeCorrs, rotate = "promax", scores = "regression")
agreeLoadings = agreeFA$loadings[1:5, 1]Now, we have everything that we need: item loadings, item correlations, and observed scores.
The first step is to get the matrix product between the inverse correlation matrix and the factor loading matrix:
w = solve(agreeCorrs, agreeLoadings)Finally, we center and scale the observed data, and do matrix multiplication for the product w that we just found:
agreeScaled = agreeDat %>%
scale(., center = TRUE, scale = TRUE)
facScores = agreeScaled %*% w
head(facScores) [,1]
61617 -0.8598383
61618 -0.2435214
61620 -0.4336288
61621 0.2284224
61622 -0.9461108
61623 0.3984620
We can check these against those returned from our results:
head(predict(agreeFA, agreeDat)) MR1
61617 -0.8598383
61618 -0.2435214
61620 -0.4336288
61621 0.2284224
61622 -0.9461108
61623 0.3984620
Now, let’s see how well those scores are correlated with simple aggregate scores.
simpleScores = agreeDat %>%
mutate(simpleScore = rowMeans(.))cor(facScores[, 1], simpleScores$simpleScore, use = "complete")[1] 0.8537173
We can see that the scores are certainly correlated, but the factor scores are likely closer to true scores.
Factor analysis falls under a broader notion of what is called Classical Testing Theory (CTT). In CTT, the notion of reliability is of supreme importance. Reliability is most easily defined as being able to produce the same result from a measure. If I give you the same measure a few months apart and you score similarly on both measures, then we are likely dealing with a reliable test. It is important to note that reliable is not the same as valid. Validity, with regard to measurement, is knowing that our measure is measuring what we intend it to measure (e.g., our measure of likelihood to engage in whistleblowing is actually measuring a person’s likelihood to blow the whistle). A measure can be reliable without being valid – remember it just has to produce the same results time after time.
Assessing reliability is generally an easy task. The most simple version of reliability, is Cronbach’s measure of internal consistency, or \(\alpha\). Cronbach’s \(\alpha\) is simply the best split-half correlation within a measure (internal consistency means that the whole measure is reliable within itself).
Getting \(\alpha\) is easy enough:
psych::alpha(agreeDat, check.keys = TRUE)
Reliability analysis
Call: psych::alpha(x = agreeDat, check.keys = TRUE)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.7 0.71 0.68 0.33 2.5 0.009 4.7 0.9 0.34
lower alpha upper 95% confidence boundaries
0.69 0.7 0.72
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
A1- 0.72 0.73 0.67 0.40 2.6 0.0087 0.0065 0.38
A2 0.62 0.63 0.58 0.29 1.7 0.0119 0.0169 0.29
A3 0.60 0.61 0.56 0.28 1.6 0.0124 0.0094 0.32
A4 0.69 0.69 0.65 0.36 2.3 0.0098 0.0159 0.37
A5 0.64 0.66 0.61 0.32 1.9 0.0111 0.0126 0.34
Item statistics
n raw.r std.r r.cor r.drop mean sd
A1- 2784 0.58 0.57 0.38 0.31 4.6 1.4
A2 2773 0.73 0.75 0.67 0.56 4.8 1.2
A3 2774 0.76 0.77 0.71 0.59 4.6 1.3
A4 2781 0.65 0.63 0.47 0.39 4.7 1.5
A5 2784 0.69 0.70 0.60 0.49 4.6 1.3
Non missing response frequency for each item
1 2 3 4 5 6 miss
A1 0.33 0.29 0.14 0.12 0.08 0.03 0.01
A2 0.02 0.05 0.05 0.20 0.37 0.31 0.01
A3 0.03 0.06 0.07 0.20 0.36 0.27 0.01
A4 0.05 0.08 0.07 0.16 0.24 0.41 0.01
A5 0.02 0.07 0.09 0.22 0.35 0.25 0.01
We get a fair chunk of output, but let’s just focus on the “std.alpha” value of .71. By most “rules-of-thumb”, anything of .7 is acceptable for general purpose use. If we were wanting to make very important decisions with our measure, we might look for something approaching or exceeding .9.
You likely noticed the “check.keys = TRUE” arguement – alpha is bound by test length and item correlations. In a case where some items are negatively correlated, that will greatly reduce our \(\alpha\) coefficient. Let’s try it for ourselves:
psych::alpha(agreeDat)Some items ( A1 ) were negatively correlated with the total scale and
probably should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
Reliability analysis
Call: psych::alpha(x = agreeDat)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.43 0.46 0.53 0.15 0.85 0.016 4.2 0.74 0.32
lower alpha upper 95% confidence boundaries
0.4 0.43 0.46
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
A1 0.72 0.73 0.67 0.398 2.64 0.0087 0.0065 0.376
A2 0.28 0.30 0.39 0.097 0.43 0.0219 0.1098 0.081
A3 0.18 0.21 0.31 0.061 0.26 0.0249 0.1015 0.081
A4 0.25 0.31 0.44 0.099 0.44 0.0229 0.1607 0.105
A5 0.21 0.24 0.36 0.072 0.31 0.0238 0.1311 0.095
Item statistics
n raw.r std.r r.cor r.drop mean sd
A1 2784 0.066 0.024 -0.39 -0.31 2.4 1.4
A2 2773 0.630 0.666 0.58 0.37 4.8 1.2
A3 2774 0.724 0.742 0.72 0.48 4.6 1.3
A4 2781 0.686 0.661 0.50 0.37 4.7 1.5
A5 2784 0.700 0.719 0.64 0.45 4.6 1.3
Non missing response frequency for each item
1 2 3 4 5 6 miss
A1 0.33 0.29 0.14 0.12 0.08 0.03 0.01
A2 0.02 0.05 0.05 0.20 0.37 0.31 0.01
A3 0.03 0.06 0.07 0.20 0.36 0.27 0.01
A4 0.05 0.08 0.07 0.16 0.24 0.41 0.01
A5 0.02 0.07 0.09 0.22 0.35 0.25 0.01
Very different, right?
Let’s calculate it on our own. We will need to recode our negatively correlated variable (A1) first.
agreeDatRecode = agreeDat %>%
mutate(A1 = abs(A1-7)) # This recodes our scale direction
nvar = length(agreeDatRecode)
# This computes the covariance matrix of our data.
C = cov(agreeDatRecode, use = "complete.obs")
n = dim(C)[2]
total = rowMeans(agreeDatRecode, na.rm = TRUE)
# The tr function is just adding everything on the
# diagonal from our covariance matrix.
alpha.raw = (1 - psych::tr(C)/sum(C)) * (n/(n - 1))How did we do? We perfectly replicated our raw alpha value from the output. This is an easy example, but more elaborate ones don’t take much more effort.
We did all of this to say that \(\alpha\) helps us to know how consistent a one factor measure is within itself. In many circumstances where you find yourself with a unidimensional factor structure, \(\alpha\) will be more than sufficient for convincing people that your scale is reliable (weaknesses aside).
There are, however, many other forms of reliability (\(\alpha\) is based off of something else called the Kuder-Richardson 20). One popular and conceptually helpful form of reliability is McDonald’s \(\omega\). McDonald’s \(\omega\) is well-suited for hierarchical factor structures, in which you might have various subscales that assess something bigger.
Earlier, we played with the bfi data. Let’s return to that, but with a hypothesis that the 5 factors of the bfi are actually subfactors of some larger latent personality variable.
bfiSubFactors = bfi %>%
select(-gender, -education, -age)
omegaTest = psych::omega(bfiSubFactors, nfactors = 5,
rotate = "promax", plot = FALSE)
omega.diagram(omegaTest, sl = FALSE)This is what our factor structure would look like in a hierarchical fashion. We have the general factor (g) and our 5 subfactors (also called grouping factors). We can see how strongly our subfactors load onto our general factor with the provided values. We can also see how strongly each individual items loads onto a subfactor.
We can also look at some reliability values (and a cleaner loading matrix):
omegaTestOmega
Call: psych::omega(m = bfiSubFactors, nfactors = 5, plot = FALSE, rotate = "promax")
Alpha: 0.82
G.6: 0.86
Omega Hierarchical: 0.52
Omega H asymptotic: 0.6
Omega Total 0.87
Schmid Leiman Factor loadings greater than 0.2
g F1* F2* F3* F4* F5* h2 u2 p2
A1- 0.40 0.19 0.81 0.09
A2 0.37 0.54 0.45 0.55 0.30
A3 0.43 0.55 0.52 0.48 0.35
A4 0.32 0.36 0.28 0.72 0.37
A5 0.45 0.23 0.44 0.46 0.54 0.45
C1 0.30 0.45 0.33 0.67 0.28
C2 0.33 0.56 0.45 0.55 0.24
C3 0.29 0.48 0.32 0.68 0.26
C4- 0.37 0.50 0.45 0.55 0.31
C5- 0.40 -0.21 0.45 0.43 0.57 0.38
E1- 0.36 0.44 0.35 0.65 0.38
E2- 0.51 0.50 0.54 0.46 0.48
E3 0.46 0.38 0.21 0.44 0.56 0.49
E4 0.51 0.46 0.53 0.47 0.48
E5 0.48 0.34 0.40 0.60 0.57
N1- 0.20 -0.78 0.22 0.65 0.35 0.06
N2- 0.20 -0.75 0.20 0.60 0.40 0.06
N3- 0.21 -0.71 0.55 0.45 0.08
N4- 0.37 -0.51 0.21 0.49 0.51 0.27
N5- 0.20 -0.51 0.35 0.65 0.11
O1 0.28 0.47 0.31 0.69 0.25
O2- 0.45 0.26 0.74 0.06
O3 0.34 0.20 0.55 0.46 0.54 0.25
O4- -0.36 0.25 0.75 0.01
O5- 0.53 0.30 0.70 0.07
With eigenvalues of:
g F1* F2* F3* F4* F5*
2.8 2.5 1.2 1.3 1.3 1.3
general/max 1.15 max/min = 2.1
mean percent general = 0.27 with sd = 0.16 and cv of 0.61
Explained Common Variance of the general factor = 0.27
Compare this with the adequacy of just a general factor and no group factors
The degrees of freedom for just the general factor are 275 and the fit is 4.42
The number of observations was 2800 with Chi Square = 12321.03 with prob < 0
The root mean square of the residuals is 0.13
The df corrected root mean square of the residuals is 0.14
RMSEA index = 0.125 and the 10 % confidence intervals are 0.123 0.127
BIC = 10138.25
Measures of factor score adequacy
g F1* F2* F3* F4*
Correlation of scores with factors 0.76 0.92 0.69 0.76 0.80
Multiple R square of scores with factors 0.58 0.85 0.47 0.59 0.63
Minimum correlation of factor score estimates 0.17 0.70 -0.06 0.17 0.27
F5*
Correlation of scores with factors 0.81
Multiple R square of scores with factors 0.66
Minimum correlation of factor score estimates 0.32
Total, General and Subset omega for each subset
g F1* F2* F3* F4*
Omega total for total scores and subscales 0.87 0.84 0.77 0.73 0.70
Omega general for total scores and subscales 0.52 0.10 0.42 0.24 0.25
Omega group for total scores and subscales 0.25 0.75 0.35 0.50 0.45
F5*
Omega total for total scores and subscales 0.49
Omega general for total scores and subscales 0.12
Omega group for total scores and subscales 0.37
While we see that we can get an \(\alpha\) value, we also have Guttman’s \(\lambda_6\), and 3 different \(\omega\) values that are of interest to us. Since we are dealing with something of a hierarchical nature, we already know that \(\alpha\) and \(\lambda_6\) won’t be appropriate. \(\omega_h\) gives us an idea about the reliability of the general factor (perhaps not super in this case) and \(\omega_t\) gives us the total reliability of the test (very good).
We can also see item loadings for the general factor and each subfactor. Generally, we find that the items that we would anticipate loading together (e.g., all N items load together, and all O items load together), along with some cross-loading items from other subfactors.
Last week, we learned about latent variables and how we can use factor analysis to construct measures for assessing latent traits. While we are still dealing in the world of latent variables, we are taking a new approach to measuring those latent variables – Item Response Theory (IRT). While both allow us to measure a person’s level of a latent trait, IRT gives us the power to understand the person and the items. We can recall that factor analysis falls under the broader heading of Classical Test Theory (CTT) – the name is a clear indication that the focus is on the test as a whole, not individual items.
In addition to this broadened understanding, IRT has a different focus from a measurement perspective than factor analysis. Perhaps most important is the notion of measurement error in IRT. In factor analysis, measurement error is assumed to be consistent across each item; this is not the case for IRT.
Before we dive into some of the more important parts of IRT, it is important to draw some distinctions between two broad families of models: dichotomous and polytomous. A dichotomous model can best be thought of in testing situations where a correct answer exists (you either get the question correct or not, so the outcome is dichotomous). With that knowledge, you might have already guessed what the polytomous model entails – items that have ordered responses with no clear correct choices (Likert response options would be a good example). Polytomous models include such models as the graded response model and the partial credit model.
We will see how these models behave differently as we continue discussing the different parts of IRT models.
In IRT, we talk a person’s ability; essentially it is the level of the latent variable that a person possesses. In the parlance of psychometrics, ability is denoted as \(\theta\). The \(\theta\) scale is standardized with a mean of 0 and tends to go from -4 to 4 (while this is generally seen, it might not always be the case).
The ability to explore individual items within IRT is based upon three different parameters: location, discrimination, and psuedoguessing. We can incorporate any of those parameters into our model. First, though, we need to understand what those parameters mean.
In IRT, the location parameter (denoted as \(\\b_i\)) described the difficulty of the item. If we are talking about a testing situation (e.g., a math or science test), the meaning of \(\\b_i\) is nearly self-evident – it is simply the probability of a correct response to the question in a dichotomous model and the probability of moving from one category to another in polytomous models (the level of passing from one category to the next is called a threshold).
Let’s look at some examples from the mirt documentation of different item locations. The ltm package is also a great package for IRT models.
library(mirt)
library(ltm)
# The model arguement set to 1 means we are just looking
# at a one factor model.
fit1PL = mirt(LSAT, model = 1, itemtype = "Rasch", verbose = FALSE)
plot(fit1PL, type = 'trace', which.items = 1:5,
facet_items = FALSE)The resulting plot is what is known as an Item Characteristic Curve. It contains a lot of information, but we are only going to pay attention to the location right now. Without any prior knowledge, which of the labelled curves would say represents the most difficult item? If you said “3” (the black line), then you are absolutely correct. When we look at the location of an item in an ICC graph, we are looking for its placement along the x-axis around its middle point (i.e., find the middle of a curve and draw a line straight down – the farther to the right, the more difficult the item). With item 3, we can see that it takes someone with slightly more than average ability to have a better than chance probability of responding correctly.
Item discrimination, the \(a_i\) parameter, demonstrates the likelihood of people with varying degrees of ability to respond correctly.
Let’s specify a slightly different model and take a look at our ICC again:
fit2PL = mirt(LSAT, model = 1, itemtype = "2PL", verbose = FALSE)
plot(fit2PL, type = 'trace', which.items = 1:5,
facet_items = FALSE)In our last model, we saw that all of the items had the exact same curve – this is no longer the case. If we pay attention to item 3 again, we see that a person with very low ability (\(\theta = -4\)) has nearly a probability of 0 to respond correctly, someone with average ability (\(\theta = 0\)) has around a 50-50 chance of responding correctly, and high ability people (\(\theta = 4\)) have nearly 100% chance of responding correctly. We would definitely say that item 3 discriminates between different ability levels pretty well (we will see if it is the most discriminating item later). Which item might not discriminate too well? Without doubt, you said item 1, and you would be correct. While people at lower ability levels have less chance of responding correctly to item 1, we can see that a probability of 1 is obtained pretty soon after crossing \(\theta = 0\). So while item 1 may provide a bit of discrimination at the lower end of \(\theta\), it does nothing when dealing average or more. This provides even more evidence as to why item 3 is such a good item – it discriminates very well along every point of \(\theta\)
So, we have a pretty good idea that item 3 is the most difficult item and the most discriminating item. This, hopefully, provides a demonstration on why IRT models are so powerful – we can compare individual items to people (on the same scales, no less).
The third parameter, \(c_i\), is the psuedoguessing parameter. Even the blind squirrel finds a nut and \(c_i\) accounts for this. Essentially, the guessing parameter offers a slight penalty for the chance of guessing a correct answer.
Now that we know what our parameters mean, we can start putting them into models. The one parameter logistic model (1PL) only estimates item difficulty (\(b_i\)). It assumes that all items discriminate equally and that all items have the same guessing parameter.
Although some minor differences, you will often see the 1PL and Rasch model used interchangibly.
The two parameter logistic model (2PL) adds the discrimination parameter, \(a_i\), into the model.
The three parameter logistic model (3PL) adds in the guessing parameter, \(c_i\). You might wonder how in the world a guessing parameter is added! If someone could guess the correct answer (think about a multiple choice test), would the probability of a correct answer ever be 0? I would think not. So, \(c_i\) just includes a lower-bound asymtote.
fit3PL = mirt(LSAT, model = 1, itemtype = "3PL", verbose = FALSE)
plot(fit3PL, type = 'trace', which.items = 1:5,
facet_items = FALSE)If we compare the ICC plots we saw for the 1PL and 2PL earlier, we see a a significant difference between the lower asymptote.
par(mfrow = c(1, 3))
plot(fit1PL, type = 'trace', which.items = 1:5,
facet_items = FALSE, main = "1PL")plot(fit2PL, type = 'trace', which.items = 1:5,
facet_items = FALSE, main = "2PL")plot(fit3PL, type = 'trace', which.items = 1:5,
facet_items = FALSE, main = "3PL")While we see that the probability for a low ability individual obtaining a correct answer for item 3 is still very low, it has improved slightly.
With all of these parameters, the probability of a correct response can be expressed as follows: \[p_i(\theta) = c_i + \frac{1 - c_i}{1 + e^{-a_i(\theta - b_i)} }\]
Scoring in IRT models works a bit differently than scoring in CTT.
An element common to both IRT and CTT is the true score, \(true_{score} = observed_{score} + e\). The distinction rests in how they treat \(e\) – CTT assumes that error is the same for everyone and IRT allows for error to vary freely across people.
IRT scores also have the added benefit of using the item response functions in the computation of scores.
We can see the factor scores returned from our model:
facScores = fscores(fit3PL)
head(facScores) F1
[1,] -1.847728
[2,] -1.847728
[3,] -1.847728
[4,] -1.444931
[5,] -1.444931
[6,] -1.444931
summary(facScores) F1
Min. :-1.8477281
1st Qu.:-0.4680632
Median : 0.0099887
Mean : 0.0000019
3rd Qu.: 0.6527025
Max. : 0.6527025
And let’s see what scores from a factor analysis would look like compared to these scores:
faScores = psych::fa(LSAT, 1, rotate = "promax")$scores
summary(faScores) MR1
Min. :-2.02900
1st Qu.:-0.44756
Median : 0.04501
Mean : 0.00000
3rd Qu.: 0.63895
Max. : 0.63895
cor(faScores, facScores) F1
MR1 0.9987849
plot(faScores, facScores)We can see some very clear similarities and an undeniable correlation between our two different factor scores. While certainly similar, we can be pretty confident that our IRT scores are closer to true scores.
We can also put those scores back into our data to see how they correlate with the other items:
library(dplyr)
LSAT %>%
mutate(scores = facScores) %>%
cor() %>%
GGally::ggcorr()We can see that overall scores are most correlated with item 3, followed by item 2, and item 4. If we recall our ICC plot from above, this probably should not surprise us too much – item 3 is definitely the best item, while items 1 and 5 don’t look as nice.
So far, we have discussed models that are largely dealing with dichotomous variables (largely incorrect and correct). These are aptly names dichotomous models. However, we know that not all measures have a binary response. If we have multiple response options (perhaps a Likert response option format), then we need to use a polytomous model. One common model is the Graded Response Model (GRM).
The GRM is used with the express purpose of ordinal, polytomous response options.
Let’s take a look:
library(psych)
library(dplyr)
testData = bfi %>%
dplyr::select(starts_with("C")) %>%
na.omit()
library(ggridges)
library(viridis)
library(ggplot2)
testData %>%
tidyr::gather(key = question, value = score) %>%
ggplot(., aes(score, question, fill = ..x..)) +
geom_density_ridges_gradient() +
scale_fill_viridis(option = "B") +
theme_minimal()grmMod = mirt(testData, model = 1, itemtype = "graded", verbose = FALSE)
coef(grmMod, simplify = TRUE, IRTpars = TRUE)$items
a b1 b2 b3 b4 b5
C1 1.443 -3.157 -2.185 -1.419 -0.347 1.200
C2 1.605 -2.771 -1.725 -1.090 -0.154 1.254
C3 1.314 -3.189 -1.935 -1.232 -0.033 1.573
C4 -1.877 0.789 -0.251 -0.877 -1.721 -2.773
C5 -1.386 1.449 0.443 -0.051 -0.969 -2.034
$means
F1
0
$cov
F1
F1 1
In our coefficient output, we see coefficients for each item (b1:b5). These coefficients detail the thresholds for moving from one response level to the next (we also see the discrimination value).
Just like our dichotomous models, we can also see our item characteristic curves.
plot(grmMod, type = 'trace', which.items = 1:5,
facet_items = TRUE, main = "GRM")We can see that these are very different looking ICC plots then what we saw before. Remember, that we are dealing with different models here. Instead of the probabililty of incorrect or correct, we are now dealing with the probability of moving from one response category to the next. To that end, each item’s ICC has a curve for each response option over ability. If we look at the ICC for C1, we can see that someone offering a response of 1 on the scale would have a higher probability of having a lower ability. Conversely, someone offering a response of 6 would have a higher probability of being high ability.
We will need the following:
install.packages(c("lme4", "lmerTest", "merTools"))While the specifics of each model you have learned to this point might take some time to get our heads all the way around, the terminology has largely been pretty clear – no more. You will hear “mixed models”, “mixed effects models”, “hierarchical linear models”, “nested models”, and/or “multilevel models”; these are all slight variations on a common theme. For the sake of our work here, we will keep it at mixed models. Within our mixed model, we have an additional source of cloudiness: fixed and random effects. The random effects don’t pose much of an issue (we will define it later), but fixed effects have 4 different definitions depending upon whom you ask. For the sake of simplicity (again), we are going to consider fixed effects as an effect on the individual unit of analysis. This will all start to make sense once we take a look at the models.
For the sake of conceptual grounding, let’s go back to our standard linear model:
library(dplyr)
library(ggplot2)
healthData = readr::read_csv("https://www.nd.edu/~sberry5/data/healthViolationsDistances.csv")
healthData = healthData %>%
mutate(BORO = as.factor(.$BORO),
cuisine = as.factor(.$`CUISINE DESCRIPTION`),
distanceCentered = dohDistanceMeter - mean(dohDistanceMeter))
lmTest = lm(SCORE ~ distanceCentered, data = healthData)
ggplot(healthData, aes(SCORE, distanceCentered)) +
geom_point() +
geom_smooth(method = "lm")summary(lmTest)
Call:
lm(formula = SCORE ~ distanceCentered, data = healthData)
Residuals:
Min 1Q Median 3Q Max
-22.249 -9.312 -3.146 5.755 95.787
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.104e+01 6.306e-02 333.615 < 2e-16 ***
distanceCentered -4.905e-05 1.354e-05 -3.622 0.000293 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.8 on 47907 degrees of freedom
Multiple R-squared: 0.0002737, Adjusted R-squared: 0.0002529
F-statistic: 13.12 on 1 and 47907 DF, p-value: 0.000293
We have our standard output here. As before, our intercept is the average score when there is zero distance between the restaurant and department of health and our coefficient for distance is telling us that for every mile increase in distance, we are increasing our score by some tiny amount. We know that we are ignoring some information within our model, namely the clustering that occurs based upon cuisine and/or borough.
Let’s take a quick look at how each of the different boroughs might behave in a model:
ggplot(healthData, aes(SCORE, distanceCentered, group = BORO)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap( ~ BORO)We can see that we have very different effects for each level of BORO.
Given what we saw in the preceding figure, mixed models will allow us to use the information for each level of BORO in the model.
Mixed models don’t get bogged down by large groups and the smaller groups do not get buried by the larger groups (in linear models, you likely learned about balance in t-tests; mixed models will attenuate the effects of group imbalance.) and mixed models will not overfit or underfit when we have repeated samples/measurement
Let’s include borough in our model. We are not going to add it as another predictor, but we are going to include it as another level to our model. The lme4 package will make this very easy:
library(lme4)
riMod = lmer(SCORE ~ distanceCentered + (1|BORO), data = healthData)Before we look at the summary for this model, let’s get an idea about what is happening in the syntax. The first part of our formula should look familiar – these are the global estimates (fixed effects) within our model and will behave exactly the same as our standard linear model.
The next part in the parentheses is how we denote our random effect. Whenever you see a 1 included in a formula interface, we can be pretty comfortable that it is in reference to a intercept. The | specifies a grouping. With that information, we might be able to guess that we are specifying a random intercept for each borough.
We should probably check out the summary:
summary(riMod)Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ distanceCentered + (1 | BORO)
Data: healthData
REML criterion at convergence: 387368.3
Scaled residuals:
Min 1Q Median 3Q Max
-1.6457 -0.7010 -0.2296 0.4065 6.9165
Random effects:
Groups Name Variance Std.Dev.
BORO (Intercept) 0.9879 0.9939
Residual 189.9733 13.7831
Number of obs: 47909, groups: BORO, 5
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.076e+01 4.592e-01 45.204
distanceCentered 1.078e-05 1.761e-05 0.612
Correlation of Fixed Effects:
(Intr)
distncCntrd -0.111
fit warnings:
Some predictor variables are on very different scales: consider rescaling
We have our standard output and we can see that our coefficients have changed from our standard linear model – this change is purely due to the severe imbalance between our groups.
We can see the imbalance between the groups by using the group_by and summarize:
healthData %>%
group_by(BORO) %>%
summarize(n())# A tibble: 5 x 2
BORO `n()`
<fct> <int>
1 BRONX 4083
2 BROOKLYN 11079
3 MANHATTAN 20250
4 QUEENS 11558
5 STATEN ISLAND 939
We can clearly see that we have large disparities between our groups.
When groups are largely balanced, we would find that our coefficients would be the same (or very close to it).
What should almost always change is our standard errors – by integrating information about the groups, we are getting a better sense of how much uncertainty our model contains at the global average level.
We also see some additional information – this is for our random effects. The standard deviation is telling us how much the score moves around based upon borough after getting the information from our fixed effects (i.e., the health score can move around nearly 1 whole point from boro alone). We can compare the standard deviation for BORO to the coefficient for distanceCentered – Borough is contributing to more variability within Scores than distance. We can also add the variance components and divide by the random effects variance to get its variance account for.
.9879 / (.9879 + 189.9733)[1] 0.005173302
So while it might be doing more than what distance does, borough is not accounting for too much variance.
We can also plot simulated random effect ranges for each of the random effect groups. We want to pay attention to those that are highlighed with black (i.e., the range does not cross the red line at 0).
library(merTools)
plotREsim(REsim(riMod))levels(healthData$BORO)[1] "BRONX" "BROOKLYN" "MANHATTAN" "QUEENS"
[5] "STATEN ISLAND"
In examining the plot, we see that the random effect ranges for the Bronx and Staten Island have significant effects on health score.
Going back to the output, did you notice anything missing: p-values! Estimating p-values in a mixed model is exceedingly difficult because of varying group sizes, complete sample n, and how those relate to reference distributions. If you need something that will help, you can get confidence intervals in the same way that you would anything else:
confint(riMod) 2.5 % 97.5 %
.sig01 5.025929e-01 1.961607e+00
.sigma 1.369612e+01 1.387068e+01
(Intercept) 1.978061e+01 2.174178e+01
distanceCentered -2.375822e-05 4.505094e-05
If you really want to see p-values, you can get them easily:
riModP = lmerTest::lmer(SCORE ~ distanceCentered + (1|BORO), data = healthData)
summary(riModP)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: SCORE ~ distanceCentered + (1 | BORO)
Data: healthData
REML criterion at convergence: 387368.3
Scaled residuals:
Min 1Q Median 3Q Max
-1.6457 -0.7010 -0.2296 0.4065 6.9165
Random effects:
Groups Name Variance Std.Dev.
BORO (Intercept) 0.9879 0.9939
Residual 189.9733 13.7831
Number of obs: 47909, groups: BORO, 5
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.076e+01 4.592e-01 4.133e+00 45.204 9.96e-07 ***
distanceCentered 1.078e-05 1.761e-05 3.455e+03 0.612 0.54
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
distncCntrd -0.111
fit warnings:
Some predictor variables are on very different scales: consider rescaling
NOTE: I would never load the lmerTest package, but would attach with colons! If you load it, you will find that it masks things from lme4 that you don’t want to have masked (i.e., lmer) and they are not equivalent objects!
Getting predicted values from our mixed model is no different then getting them from any other model:
mixedPred = predict(riMod)
slimPred = predict(lmTest)
head(cbind(actual = healthData$SCORE,
mixed = mixedPred,
slim = slimPred), 20) actual mixed slim
1 18 21.40671 20.67657
2 39 21.33549 21.00055
3 12 21.38798 20.28845
4 28 21.34376 20.48960
5 10 21.28248 21.24163
6 4 21.32554 21.04579
7 9 19.45224 20.67354
8 18 21.68792 21.12855
9 25 21.65763 21.26634
10 19 20.17974 20.97907
11 7 20.21237 20.83064
12 17 20.10837 21.30368
13 10 20.12837 21.21274
14 9 20.12701 21.21891
15 38 20.19149 20.92563
16 19 21.68716 21.13201
17 11 21.66144 21.24901
18 11 21.70010 21.07315
19 18 21.66504 21.23266
20 17 21.33218 21.01559
While there were cases where the standard linear model did a slightly better job, our mixed model generally did a better job (even if marginally so).
Let’s add some more information to our model. As we dive into our data, we will notice that we also have cuisine groupings. We can add this additional grouping into our model:
clusterMod = lmer(SCORE ~ distanceCentered + (1|cuisine) + (1|BORO), data = healthData)This is often called a cross-classified model.
summary(clusterMod)Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ distanceCentered + (1 | cuisine) + (1 | BORO)
Data: healthData
REML criterion at convergence: 386632.7
Scaled residuals:
Min 1Q Median 3Q Max
-1.8661 -0.6856 -0.2399 0.4194 6.8509
Random effects:
Groups Name Variance Std.Dev.
cuisine (Intercept) 2.901 1.703
BORO (Intercept) 1.867 1.366
Residual 186.933 13.672
Number of obs: 47909, groups: cuisine, 12; BORO, 5
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.132e+01 8.116e-01 26.269
distanceCentered -2.450e-05 1.791e-05 -1.368
Correlation of Fixed Effects:
(Intr)
distncCntrd -0.067
fit warnings:
Some predictor variables are on very different scales: consider rescaling
Let’s look at our variances how we did earlier:
# cuisine
2.90 / (2.90 + 1.867 + 186.933)[1] 0.0151278
# BORO
1.867 / (1.867 + 2.90 + 186.933)[1] 0.009739176
plotREsim(REsim(clusterMod))We should also see if our predictions improved:
clusterPredict = predict(clusterMod)
head(cbind(actual = healthData$SCORE,
clustered = clusterPredict,
mixed = mixedPred,
slim = slimPred), 20) actual clustered mixed slim
1 18 19.28713 21.40671 20.67657
2 39 19.44896 21.33549 21.00055
3 12 20.07550 21.38798 20.28845
4 28 20.17598 21.34376 20.48960
5 10 23.67651 21.28248 21.24163
6 4 19.47156 21.32554 21.04579
7 9 17.25207 19.45224 20.67354
8 18 24.36511 21.68792 21.12855
9 25 20.32682 21.65763 21.26634
10 19 22.06295 20.17974 20.97907
11 7 17.88168 20.21237 20.83064
12 17 21.14971 20.10837 21.30368
13 10 21.93627 20.12837 21.21274
14 9 22.18274 20.12701 21.21891
15 38 22.03625 20.19149 20.92563
16 19 24.36684 21.68716 21.13201
17 11 20.31816 21.66144 21.24901
18 11 25.39586 21.70010 21.07315
19 18 20.30999 21.66504 21.23266
20 17 19.45647 21.33218 21.01559
For many observations, our predictions definitely go tighter (many are still far off, though).
If we continue to look at our data (and with some knowledge about how NYC does health inspections), we will see that restaurants are rated yearly – let’s use this information in our model. We won’t worry about distance anymore, because now we have a few competing hypotheses. We could imagine two different ways that the works: one in which a restuarant’s score improves as observations increase (it takes some time for the owner to get his staff fully up to speed) or one in which the score decreases as the observations increase (the “shine has worn off the penny”).
Let’s do a bit of data processing first.
healthDataGrouped = healthData %>%
tidyr::unite(col = nameLocation, DBA, BUILDING , remove = FALSE) %>%
group_by(nameLocation) %>%
arrange(lubridate::mdy(`GRADE DATE`)) %>%
mutate(observation = 1:n())
timeReviewed = healthDataGrouped %>%
summarize(n = n()) %>%
filter(n > 10)
reviewedRest = healthDataGrouped[which(healthDataGrouped$nameLocation %in%
timeReviewed$nameLocation), ]observationMod = lmer(SCORE ~ observation + (1|nameLocation), data = reviewedRest)In this model, we have a fixed effect for observation and we are allowing each location to have it’s own random intercept. We have essentially created a model that will deal with the repeated measures for each of the locations.
reviewedRest %>%
arrange(nameLocation, observation) %>%
head(., 350) %>%
ggplot(., aes(observation, SCORE, group = nameLocation)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap( ~ nameLocation) +
theme_minimal()summary(observationMod)Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ observation + (1 | nameLocation)
Data: reviewedRest
REML criterion at convergence: 345661
Scaled residuals:
Min 1Q Median 3Q Max
-5.0067 -0.5765 -0.1311 0.4606 6.6361
Random effects:
Groups Name Variance Std.Dev.
nameLocation (Intercept) 29.46 5.428
Residual 125.82 11.217
Number of obs: 44622, groups: nameLocation, 1750
Fixed effects:
Estimate Std. Error t value
(Intercept) 13.948018 0.159359 87.53
observation 0.461932 0.005331 86.65
Correlation of Fixed Effects:
(Intr)
observation -0.455
Our fixed effect here would indicate that we have a slight increase in scores as our observations increase, but we can also see that scores will bounce around about 5 points on average by location alone.
29.46 / (29.46 + 125.82)[1] 0.1897218
We see that the location alone accounts for nearly 20% of the variance in health scores.
Hierarchical models are a slight variation on the models that we have just seen. In these models, we have groups nested within other groups. We know that we have a “BORO” variable, but a quick look at our data will show that we also have a DBA variable; this variable is giving the name of the restaurant. One source of inquiry would be to inspect how various chain restaurants, nested within each boro, perform.
Let’s check out how some chain restaurants do within the boroughs.
chainFood = healthDataGrouped %>%
filter(DBA == "BURGER KING" |
DBA == "MCDONALD'S" |
DBA == "PIZZA HUT" |
DBA == "SUBWAY")The model set-up is just different enough to cause some potential confusion here. Within the parentheses, we have our intercept as before, but we are also saying that we are looking at the DBA groups within the BORO groups.
hierMod = lme4::lmer(SCORE ~ observation + (1|BORO/DBA),
data = chainFood)
summary(hierMod)Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ observation + (1 | BORO/DBA)
Data: chainFood
REML criterion at convergence: 1317.5
Scaled residuals:
Min 1Q Median 3Q Max
-1.9087 -0.5211 -0.1990 0.1679 3.7629
Random effects:
Groups Name Variance Std.Dev.
DBA:BORO (Intercept) 11.076 3.328
BORO (Intercept) 1.877 1.370
Residual 168.917 12.997
Number of obs: 165, groups: DBA:BORO, 9; BORO, 5
Fixed effects:
Estimate Std. Error t value
(Intercept) 7.5428 2.2741 3.317
observation 1.2320 0.2306 5.342
Correlation of Fixed Effects:
(Intr)
observation -0.609
We have our fixed effect for observation (indicating that there is a positive relationship between observations and scores) and we have individual intercepts for both BORO and DBA nested within BORO. Looking at the standard deviation of our random effects, we can see that the scores bounce around a little over 3 points from chain to chain. If we explore the variance components, we see that around 6% (11.076 / (11.076 + 1.877 + 168.91)) of the variance in score is handled within our nested random effect.
We can also look at our effect ranges:
plotREsim(REsim(hierMod))Even though it does not look like there is too much going on in the way of effects here, we still need to know the group(s) that each dot represent(s):
unique(hierMod@flist$BORO)[1] QUEENS BROOKLYN MANHATTAN STATEN ISLAND BRONX
Levels: BRONX BROOKLYN MANHATTAN QUEENS STATEN ISLAND
unique(hierMod@flist$`DBA:BORO`)[1] MCDONALD'S:QUEENS SUBWAY:BROOKLYN
[3] MCDONALD'S:MANHATTAN PIZZA HUT:MANHATTAN
[5] MCDONALD'S:BROOKLYN PIZZA HUT:QUEENS
[7] MCDONALD'S:STATEN ISLAND BURGER KING:BROOKLYN
[9] MCDONALD'S:BRONX
9 Levels: BURGER KING:BROOKLYN MCDONALD'S:BRONX ... SUBWAY:BROOKLYN
They appear in the same order in the graph as they do printed out, so we can just map them on. While the interval is large and certainly crosses 0, it seems like the effect for McDonald’s in the Bronx is sizeable.
Now that we have seen random intercepts and hierarchical models, we can add one final piece: random slopes. In the following model, we will specify a random intercept (recall, it is the 1 within our parenthesis) and a random slope (we are putting the prefixing our grouping variable with a slope of interest). Not only will this model allow the intercept to vary between groups, but it will also allow the slope to vary between groups.
observationMod = lmer(SCORE ~ observation + (1 + observation|nameLocation), data = reviewedRest)
summary(observationMod)Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ observation + (1 + observation | nameLocation)
Data: reviewedRest
REML criterion at convergence: 342130.5
Scaled residuals:
Min 1Q Median 3Q Max
-5.3899 -0.5413 -0.0949 0.4448 7.6038
Random effects:
Groups Name Variance Std.Dev. Corr
nameLocation (Intercept) 40.3173 6.3496
observation 0.2546 0.5046 -0.56
Residual 109.6194 10.4699
Number of obs: 44622, groups: nameLocation, 1750
Fixed effects:
Estimate Std. Error t value
(Intercept) 12.42496 0.18418 67.46
observation 0.63182 0.01486 42.51
Correlation of Fixed Effects:
(Intr)
observation -0.648
convergence code: 0
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?
Nothing changes with regard to our fixed effects, but we get some added information in our random effects. The random intercept variance for each location is telling us the amount that the first score bounces around from place to place (a pretty massive amount, if you ask me) and the observation variance is telling us how much variability there is within the slope between locations.
If we use the predicted values of our model, we can see what our results will look like over the observations:
randEffPlot = reviewedRest %>%
ungroup() %>%
mutate(mixedPrediction = predict(observationMod)) %>%
group_by(nameLocation) %>%
ggplot(., aes(observation, mixedPrediction, group = nameLocation)) +
geom_line(alpha = .5) +
theme_minimal()
randEffPlotAmazing! We can really see the varying intercepts for each restaurant and we can see how the slopes are completely different (some are similar, but we have a completely mixed bag of directions and magnitudes).
For the sake of demonstration, and hopefully as a way to bring everyting together, let’s see what predictions from a standard regression might look like compared to our mixed model:
slimPred = predict(lm(SCORE ~ observation, data = reviewedRest))
reviewedRest = reviewedRest %>%
ungroup() %>%
mutate(slimPrediction = slimPred)
randEffPlot +
geom_line(data = reviewedRest, aes(observation, slimPrediction),
color = "#ff5500", size = 2)It looks like the standard regression line would do okay for many of our locations, but we can see that it would do poorly with many others. This should help to provide an illustration of just how flexible and powerful mixed model can be in your hands.
Data generated by people has missing values – this is an unfortunate truth. While shying away from the truth might someone’s idea of a good time, missing data provides us with some interesting issues to tackle.
The vast majority of statistical techniques are not robust to missingness. Take our standard linear model (lm):
library(dplyr)library(magrittr)
testData = read.csv("missingSurvey.csv")
testData %>%
lm(average_weekly_hours ~ EMP_Engagement_5, data = .) %>%
summary()
Call:
lm(formula = average_weekly_hours ~ EMP_Engagement_5, data = .)
Residuals:
Min 1Q Median 3Q Max
-14.4551 -5.5699 0.2005 5.3153 31.4301
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.9143 0.2439 73.46 <2e-16 ***
EMP_Engagement_5 2.8852 0.0913 31.60 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.031 on 13743 degrees of freedom
(1254 observations deleted due to missingness)
Multiple R-squared: 0.06775, Adjusted R-squared: 0.06768
F-statistic: 998.7 on 1 and 13743 DF, p-value: < 2.2e-16
Do you see any discrepancies? If we look at the number of rows within our data, we see a value of 14999 - a very decent sample for sure. If, however, we take a look at our degrees of freedom from our model output, we see that a considerable amount of rows were dropped (the output even gives us a message letting us know how many were dropped). A linear regression cannot use missing information, so the whole row gets dropped if any missingness exists within the variables being used. If an observation or dozen get(s) dropped out of a few hundred rows, we might not have too much to worry about. What about if more than half of the observations get dropped? It’s probably not a good thing.
When we talk about missing data, it is easy to assume that it is just missing: nothing more and nothing less. What if we knew why those were missing. Would that change your thoughts on missingness? What if you knew that certain people did not responsd to certain questions (for example, people who do not identify with any particular religion may leave a question about religious strength blank). Just thinking about such an example, would grouping that kind of missingness with someone who just skipped a question purely through omission make sense? You would likely say that they are different and you would be correct. Generally, we can think of three different types of missing values: missing completely at random, missing at random, and missing not at random.
Data missing completely at random (MCAR), is exactly how it sounds – the missing data is not related to any study variables and it is not related to any unknown process or parameter about the person generating the data. We can assume that MCAR data is not biased.
Data missing at random (MAR) is a bit different and little trickier. MAR data, despite the name, assumes that data is not really missing at random, but is instead fully explained by some other observed variable not related to the missing data.
Missing not at random (MNAR) just extends MAR to assume that the missing data is caused by the missing data.
Determining which type of missingness is present within your data can be tricky. In the most ideal world, we would have the data to know whether it is MCAR or something else. I think you can see the circular nature that we are getting ourselves in!
There are, however, some methods that will let us test our missingness. The MissMech package will perform some tests that can help.
library(MissMech)
library(dplyr)
mcarTest = TestMCARNormality(testData)
mcarTestCall:
TestMCARNormality(data = testData)
Number of Patterns: 7
Total number of cases used in the analysis: 14999
Pattern(s) used:
EMP_Engagement_1 EMP_Engagement_2 EMP_Engagement_3
group.1 1 1 1
group.2 1 1 1
group.3 1 1 1
group.4 1 NA 1
group.5 1 1 1
group.6 NA 1 1
group.7 1 1 NA
EMP_Engagement_4 EMP_Engagement_5 average_weekly_hours
group.1 1 1 NA
group.2 1 1 1
group.3 NA 1 1
group.4 1 1 1
group.5 1 NA 1
group.6 1 1 1
group.7 1 1 1
Number of cases
group.1 618
group.2 11214
group.3 635
group.4 658
group.5 636
group.6 618
group.7 620
Test of normality and Homoscedasticity:
-------------------------------------------
Hawkins Test:
P-value for the Hawkins test of normality and homoscedasticity: 7.09661e-304
Either the test of multivariate normality or homoscedasticity (or both) is rejected.
Provided that normality can be assumed, the hypothesis of MCAR is
rejected at 0.05 significance level.
Non-Parametric Test:
P-value for the non-parametric test of homoscedasticity: 1.413518e-08
Hypothesis of MCAR is rejected at 0.05 significance level.
The multivariate normality test is inconclusive.
In our output, we get a glimpse at what missing patterns exists. We can see that there are 11205 complete cases (group.2 has a 1 for every variable). If we look at the Hawkins test for multivariate normality and homoscedasticity, we would have to reject multivariate normality and/or homoscedasticity (i.e., we are rejecting the hypothesis that our data is normal and homoscedastic – not surprising since this is Likert-flavored survey data). Since multivariate normality is out of the picture, we cannot use the Hawkins test (if we could, though, it would demonstrate that we would need reject our MCAR assumption and accept that our data is not missing completely at random), we can go down to the non-parametric test. In looking at the non-parametric test, we can see that we likely cannot reject the MCAR assumption there either.
Had our results been different, we would have gotten a message stating, “There is not sufficient evidence to reject normality or MCAR at 0.05 significance level.”.
The language here is important – notice that there was very little in the way of absolutes because we can never be 100% confident that our data is missing completely at random.
The TestMCARNormality function will only take numeric variables, so be sure to keep that in mind when using it.
If we can establish MCAR, then we can proceed with some imputation without fear of introducing a terrible amount of bias. If we cannot establish that our missing data was missing completely at random, then our multiply imputed data might have some issues.
Many types of mean imputation exist: mean imputation, median imputation, among others. The central tendancy-based imputation methods are classical holdovers. In essence, they replace a missing value with the mean/median of the column.
Here is a brief demonstration of how this might work:
testData %>%
mutate(average_weekly_hours = ifelse(is.na(average_weekly_hours) == TRUE,
mean(average_weekly_hours, na.rm = TRUE),
average_weekly_hours),
EMP_Engagement_5 = ifelse(is.na(EMP_Engagement_5),
mean(EMP_Engagement_5, na.rm = TRUE),
EMP_Engagement_5)) %>%
lm(average_weekly_hours ~ EMP_Engagement_5, data = .) %>%
summary()
Call:
lm(formula = average_weekly_hours ~ EMP_Engagement_5, data = .)
Residuals:
Min 1Q Median 3Q Max
-14.283 -5.516 -0.378 5.251 33.622
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.2144 0.2344 77.69 <2e-16 ***
EMP_Engagement_5 2.7672 0.0879 31.48 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.912 on 14997 degrees of freedom
Multiple R-squared: 0.06199, Adjusted R-squared: 0.06193
F-statistic: 991.1 on 1 and 14997 DF, p-value: < 2.2e-16
Our missing values are fixed, so all is well. In reality, we are making a pretty big logical leap with assuming that those missing values would be the mean. It might not be the worst assumption that we could make (if we knew that this was the actual population average, it might be slightly more reasonable), but it is still tenuous.
When statistics were done on punchcards, this was probalby about the best that could be done. With greater computing power and more advanced statistical methods, we can get much better imputation values.
Multiple imputation takes those simple imputation techniques much further. If we are using multivariate imputations by chained equations (mice), then we are going through a lengthy process to estimate the missing values and using several imputed datasets for our models. Data imputed with mice will use predictions on every variable around the missing data to predict the value of the missing data.
We are only interested in imputing our variable with missingness, so we will specify the “cart” method (classification and regression trees) for our engagement questions and predictive mean matching for our hours worked. There are many options depending upon your exact needs (i.e., data types), so spend some time looking at the built in imputation methods in mice. The “cart” and “pmm” methods were used here purely for flexibility.
library(mice)
imputedData = mice(testData, m = 10, maxit = 20,
method = c("cart", "cart", "cart",
"cart", "cart", "pmm"),
pred = quickpred(testData, minpuc = .2, mincor = .01),
print = FALSE)
imputedDataClass: mids
Number of multiple imputations: 10
Imputation methods:
EMP_Engagement_1 EMP_Engagement_2 EMP_Engagement_3
"cart" "cart" "cart"
EMP_Engagement_4 EMP_Engagement_5 average_weekly_hours
"cart" "cart" "pmm"
PredictorMatrix:
EMP_Engagement_1 EMP_Engagement_2 EMP_Engagement_3
EMP_Engagement_1 0 0 0
EMP_Engagement_2 0 0 0
EMP_Engagement_3 0 0 0
EMP_Engagement_4 1 1 1
EMP_Engagement_5 1 1 0
average_weekly_hours 1 0 1
EMP_Engagement_4 EMP_Engagement_5
EMP_Engagement_1 0 0
EMP_Engagement_2 1 1
EMP_Engagement_3 1 0
EMP_Engagement_4 0 0
EMP_Engagement_5 0 0
average_weekly_hours 0 1
average_weekly_hours
EMP_Engagement_1 0
EMP_Engagement_2 1
EMP_Engagement_3 1
EMP_Engagement_4 1
EMP_Engagement_5 1
average_weekly_hours 0
plot(imputedData, layout = c(2, 1))Will you look at that plot! These are trace lines. They track each imputation model over each iteration. What do we take out of these? If they look like fuzzy caterpillars, the chained equation process was good. If there is any discernable pattern, then something odd might have happened – these look just as crazy as they should! The mean for our imputed variables generally bounced around between 2.45 and 2.65. Given the observed mean, this seems pretty good.
The chained equations that we are dealing with in mice are very much akin to Markov Chain Monte Carlo methods.
Here is a conceptual example of markov chains! This example is adapted from Richard McElreath’s excellent book, Statistical Rethinking.
You manage 10 teams, named from 1 to 10. Each team name is proportional to the number of people on the team and each team is on a separate floor in the building.
You need to visit a team everyday, proportional to the number of people on the team (i.e., you would visit team 10 more often than team 1).
At the end of the day, you randomly select whether a proposed move will take you up a floor or down a floor.
After randomly selecting up or down, you grab a number of pens that is equal to the number of the current team (for team 10, you would grab 10 pens).
Next, you would grab a number of pencils corresponding to the proposed move (your randomly selected proposal would take you up a floor, so starting back at the bottom with team 1). So you would have 10 pens and 1 pencil.
If you have more pencils than pens, you will always move to the proposed floor.
If you have more pens than pencils, you set down a number of pens equal to the number of pencils and put them in a drawer.
You reach back into the drawer to randomly select a pen or a pencil.
Your selection decides where you go – pen is stay and pencil is go!
You should play with the following function (uncomment the line, play with starting values, etc.):
markovSim = function(daysRun, startDay) {
position = rep(0, daysRun)
current = startDay
for(i in 1:daysRun) {
position[i] = current
proposal = current + sample(c(-1, 1), size = 1)
if(proposal < 1) proposal = 10
if(proposal > 10) proposal = 1
probabilityMove = proposal / current
current = ifelse(runif(1) < probabilityMove, proposal, current)
# print(paste(position[i], proposal, probabilityMove, current, sep = " -> "))
}
return(position)
}
test1 = markovSim(1000, 5)
test2 = markovSim(1000, 6)
library(ggplot2)
ggplot() +
geom_line(data = as.data.frame(test1), aes(1:length(test1), test1),
color = "#ff5500", size = .75) +
geom_line(data = as.data.frame(test2), aes(1:length(test2), test2),
color = "black", size = .75) +
theme_minimal()Does that plot look at all familiar – it sure does look like a Pyrrharctia isabella larvae to me. Beyond its similarity to woolybear caterpillar, this is a perfect illustration of a markov chain.
As fun as that was, let’s get back to the matter at hand.
Now, we have 10 imputed data sets that have been through 20 iterations that we can perform separate analyses on.
fit1 = with(data = imputedData,
exp = lm(average_weekly_hours ~ EMP_Engagement_5))We can check on each individual imputed data set:
summary(fit1)# A tibble: 20 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 17.9 0.235 76.4 0.
2 EMP_Engagement_5 2.89 0.0878 32.9 1.23e-228
3 (Intercept) 17.9 0.234 76.4 0.
4 EMP_Engagement_5 2.89 0.0877 32.9 7.62e-230
5 (Intercept) 17.9 0.234 76.2 0.
6 EMP_Engagement_5 2.91 0.0877 33.1 3.57e-232
7 (Intercept) 18.0 0.234 77.0 0.
8 EMP_Engagement_5 2.85 0.0875 32.6 1.24e-224
9 (Intercept) 17.8 0.234 76.0 0.
10 EMP_Engagement_5 2.91 0.0876 33.3 5.52e-234
11 (Intercept) 17.9 0.234 76.2 0.
12 EMP_Engagement_5 2.91 0.0878 33.1 5.27e-232
13 (Intercept) 17.8 0.234 76.0 0.
14 EMP_Engagement_5 2.93 0.0876 33.5 5.97e-237
15 (Intercept) 17.7 0.234 75.8 0.
16 EMP_Engagement_5 2.95 0.0875 33.7 1.02e-239
17 (Intercept) 17.9 0.234 76.5 0.
18 EMP_Engagement_5 2.91 0.0875 33.3 3.63e-234
19 (Intercept) 17.8 0.233 76.2 0.
20 EMP_Engagement_5 2.94 0.0874 33.7 5.26e-240
All of our coefficients are hoving around or slightly above our original coefficient, but our standard errors are uniformly smaller.
And we can then pool those results together:
pooledFit = pool(fit1)
summary(pooledFit) estimate std.error statistic df p.value
(Intercept) 17.847540 0.25054709 71.23427 531.0723 0
EMP_Engagement_5 2.908382 0.09281962 31.33370 718.7926 0
How different are these pooled coefficients compared to our previous coefficients? The pooled are certainly a bit stronger, but not terribly so.
On occasion you will see some data that looks like it has a lot of missingness. It might not truly be missing, it just might be sparse. Missing means that something was skipped (think about a survey question) and you have no idea what the value actually is. In sparse data, the observations have a zero value.
Sparsity comes up in a great many places, but the following example will hopefully help to make the point clearer. I shop on Amazon, but I have not bought everything there is to buy on Amazon. Not only have I not bought everything, I have not even seen every product on Amazon. If there were to be a matrix of every item that I have bought off of Amazon (all of those products get a 1) and everything I have not bought off Amazon (all of those products would get a 0), we would have very sparse data. In other words, Amazon actually knows what I have and have not bought.
On the other hand, let’s say that I just got a middle of the road Red Dragon keyboard RGB mechanical keyboard. Amazon knows that I bought it, so I would get a 1 in that column. However, I have not rated said keyboard; therefore, Amazon has no idea how much I actually like the keyboard. Essentially, there is a blank in the ratings column that cannot be assumed to be 0.
There is no shortage of ways to cluster data and the reasons to do it are just as numerous. Clustering is great for reducing the number of dimensions within your data (i.e., combining several variables into one), but it is equally well suited to finding natural groups within data.
There are several large classes of cluster analyses, such as centroid or hierarchical clustering. Two of the bigger divisions are hard clustering and fuzzy clustering. Hard clustering algorithms assign observations to a specific cluster, whereas fuzzy clustering algorithms are probablistic in nature (i.e., an observation has a certain probability to belonging to any one cluster). In
Centroid clustering is based upon distances around some centroid (maybe a mean). Many centroid models tend to produce hard clustering by nature.
Careful attention needs to be paid to your data when clustering. Are they variables that naturally contain mean? If so, a k-means clustering might work well.
Determining the number of clusters is very much akin to determining the number of factors in factor analysis – there can be theory guiding the practice, but various statistics can help us make our decisions too. When clustering, especially hard clustering, we need to be thoughtful about the clarity of our clusters – too few clusters might not be helpful, but too many clusters might not separate very well. There are several ways to test the appropriate numbers of factors (scree plots, silhouette), but we are going to use the GAP statistic to select an appropriate number of k.
As an example, let’s use the data from the Holzinger Swineford article, found in the lavaan package.
library(factoextra)
library(lavaan)
library(dplyr)
testData = HolzingerSwineford1939 %>%
dplyr::select(starts_with("x"))
kTest = NbClust::NbClust(testData, method = "kmeans")*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 4 proposed 3 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 5 proposed 8 as the best number of clusters
* 1 proposed 13 as the best number of clusters
* 1 proposed 14 as the best number of clusters
* 1 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
We are going to get a lot of information from the NbClust function, but a majority rule for number of clusters is the most helpful.
With our proposed number of clusters, let’s turn to some models.
Let’s run a k-mean model with our test data.
kmeansTest = kmeans(testData, 2)
kmeansTestK-means clustering with 2 clusters of sizes 129, 172
Cluster means:
x1 x2 x3 x4 x5 x6 x7 x8
1 5.683463 6.565891 2.813953 3.894057 5.259690 2.973422 4.320863 5.743023
2 4.375000 5.729651 1.827762 2.436047 3.651163 1.594684 4.084681 5.365116
x9
1 5.785745
2 5.065407
Clustering vector:
[1] 2 2 2 1 2 2 2 2 1 2 2 1 1 2 1 2 2 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2
[36] 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 1 2 2 2
[71] 2 2 2 1 2 2 1 2 1 2 2 1 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1 1 2 1
[106] 1 1 1 1 2 1 1 1 1 2 2 1 2 2 2 2 2 1 2 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1
[141] 2 2 1 1 2 2 1 2 1 2 1 1 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 1 2 1 1 2
[176] 1 2 1 1 1 2 1 1 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 2 2 1 1 2 2 2 1 2 2 2 2
[211] 1 1 1 1 2 2 1 2 2 2 2 2 1 2 1 1 2 2 2 1 2 2 1 2 2 2 1 1 1 1 1 1 1 1 2
[246] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 2 1 1 2 2 2 2 2 1 2 1
[281] 2 1 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 2 1 2 1
Within cluster sum of squares by cluster:
[1] 1169.721 1484.421
(between_SS / total_SS = 22.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
Our output provides some very useful information. First, we see the cluster means for each item across the two cluster. These give us a pretty good idea of the “location” for each of the clusters and helps us to start to get an idea for what these cluster look like (we essentially have one higher scoring group and one lower scoring group).
Next, we see our clustering vector. This tells us to which cluster each observation was assigned.
We can also take a look at the bivariate plots, which will give us the cluster membershp within each of the bivariate plots.
plot(testData, col = kmeansTest$cluster)We can see the variables were our clusters separate pretty nicely (e.g., x1 and x4) and were they do not (x8 and x9).
And a plot based upon the first two components of a PCA:
cluster::clusplot(testData, kmeansTest$cluster)The k-means cluster is great and widely used. But, it can be picky about data (it really likes data without extreme values) and it can have different results based upon item ordering (because of the cluster assignment process).
To circumvent the issue within k-means, we can partition around medoids. Whereas our centroid can be an arbitrary value that gets observations clustered around it, a medoid is an actual observation within the data that then gets other observations clustered around it. In other words, the centroid does not relate to a specific observation, so the centroid cannot be linked to a specific observation.
library(cluster)
pamTest = cluster::pam(testData, k = 2)
pamTestMedoids:
ID x1 x2 x3 x4 x5 x6 x7 x8 x9
[1,] 162 5.000000 6.25 2.5 3.333333 5.75 2.571429 4.086957 5.65 5.583333
[2,] 11 3.666667 5.75 2.0 2.000000 3.50 1.571429 3.739130 5.70 4.305556
Clustering vector:
[1] 1 2 2 1 2 2 1 2 1 2 2 1 1 1 1 2 2 2 1 1 2 1 1 2 2 2 2 2 1 2 2 2 1 1 2
[36] 2 1 2 1 1 2 2 2 2 1 1 2 2 2 1 2 1 2 2 1 1 2 1 2 2 1 1 2 1 2 1 1 2 2 1
[71] 2 2 2 1 2 2 1 2 1 2 2 1 2 2 2 1 1 1 2 1 2 2 2 2 1 1 1 1 2 1 1 1 1 1 1
[106] 1 1 1 1 2 1 1 1 1 2 1 1 2 2 2 1 2 1 2 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1
[141] 1 2 1 1 2 1 1 2 1 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1 2 2 1 2 1 2 1 1 1
[176] 1 2 1 1 1 1 1 1 2 2 1 2 1 2 2 2 1 1 2 2 1 1 1 2 2 1 1 2 1 1 1 2 2 2 2
[211] 1 1 1 1 2 1 1 2 1 2 2 2 1 2 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1
[246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1
[281] 2 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 2 2 1 1 1
Objective function:
build swap
3.159638 2.999334
Available components:
[1] "medoids" "id.med" "clustering" "objective" "isolation"
[6] "clusinfo" "silinfo" "diss" "call" "data"
We can see very similar output to what we got from k-means, but we get the actual observations that represent the medoids. This identification helps to us to get a better idea about what the clusters look like.
cluster::clusplot(testData, pamTest$clustering)If we look at both plots, we can notice some definite differences in cluster assignment when we are getting towards the 0s (pam puts some observations in one cluster, while kmean puts the same observation into a different cluster). This is purely a function of how the two methods compute the centroid.
par(mfrow = c(1, 2))
cluster::clusplot(testData, pamTest$clustering, main = "PAM")
cluster::clusplot(testData, kmeansTest$cluster, main = ("KMeans"))Any centroid-based clustering algorithm works great when our data looks pretty normal. Data that might not be normally distributed might cause some issues in finding a decent centroid.
Imagine a variable distributed in the following manner:
It looks like it has something close to a normal distribution on the right and something a bit lumpy on the left. Using a centroid-based algorithm would find a centroid, but we could almost imagine that this distribution is a mix of two different distributions – any centroid is going to betray the actual distribution of this data. Situations like this call for distribution clustering
Sometimes referred to as model-based clustering, distribution clustering allows us to model variables that have mixed distributions (maybe a mixture of normal distributions or a normal distribution and something like the bimodal-looking distribution we just saw).
If your data have a mixture of normals or you just want the actual distribution of your data to drive the clustering, then these distribution clustering algorithms will offer better solutions. Furthermore, they are better suited for clustering items that might have a latent structure.
Given that we are using our data’s distributions, we need to do another check on the number of clusters. Not only will we get a good number of clusters, but we will also get the clustering shapes that best fit our data.
library(mclust)
bicTest = mclustBIC(testData)
bicTestBayesian Information Criterion (BIC):
EII VII EEI VEI EVI VVI EEE
1 -8395.208 -8395.208 -8411.764 -8411.764 -8411.764 -8411.764 -7698.368
2 -8091.098 -8096.223 -8098.389 -8103.418 -8087.308 -8092.388 -7684.869
3 -8043.834 -8026.535 -7946.965 -7948.718 -7978.589 -7981.702 -7720.952
4 -7997.530 -7996.730 -7925.971 -7930.479 -7995.874 -8017.191 -7757.011
5 -7974.597 -7980.670 -7896.040 -7898.856 -7998.222 -8002.864 -7766.243
6 -7967.265 -7976.579 -7914.053 -7913.319 -8024.885 -8023.845 -7814.074
7 -7966.191 -7972.153 -7910.080 -7928.153 -8071.526 -8078.025 -7816.869
8 -7974.345 -8002.181 -7926.699 -7948.928 -8109.304 -8123.281 -7858.328
9 -8002.500 -8035.987 -7960.374 -7972.648 -8150.685 -8173.271 -7889.681
EVE VEE VVE EEV VEV EVV VVV
1 -7698.368 -7698.368 -7698.368 -7698.368 -7698.368 -7698.368 -7698.368
2 -7705.746 -7683.987 -7703.536 -7808.094 -7816.565 -7843.921 -7833.820
3 -7744.304 -7715.837 -7759.226 -7956.501 -7973.278 -8059.269 -8045.041
4 -7812.783 -7746.291 -7808.855 -8125.923 -8115.375 -8226.645 -8225.402
5 -7864.346 -7781.799 -7879.806 -8277.594 -8292.735 -8402.984 -8457.127
6 -7952.675 -7818.828 -7945.551 -8411.152 -8456.776 -8613.663 -8680.916
7 -7980.994 -7829.336 -7936.425 -8586.715 -8612.927 -8804.420 -8834.708
8 -8023.371 -7866.886 -8058.225 -8789.471 -8814.082 -9049.331 -9086.450
9 -8094.573 -7906.292 -8111.024 -8969.355 -9024.142 -9281.446 -9344.505
Top 3 models based on the BIC criterion:
VEE,2 EEE,2 EEE,1
-7683.987 -7684.869 -7698.368
In looking at the bottom of the ouput (Top 3 models based on the BIC criterion), we see that a two cluster solution might work well, given that those are the first two solutions offered. The best solution, however, is “VEE, 2” – what does VEE mean? It means that our mixture is ellipsoidal in shape with equal volume and orientation. Check the mclustModelNames function for all names.
Let’s take a peak at it:
mclustTest = Mclust(testData, 2, modelNames = "VEE")
summary(mclustTest)----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VEE (ellipsoidal, equal shape and orientation) model with 2
components:
log.likelihood n df BIC ICL
-3656.513 301 65 -7683.987 -7715.366
Clustering table:
1 2
205 96
We can see that the provided output is very different than our previous clustering efforts. We have a cluster table, demonstrating how many observations are grouped into the two clusters. We also have mixing probabilites.
mclustTest$parameters$pro[1] 0.6638782 0.3361218
The mixing probabilities add up to 1 and are just the probabilities that an observation will wind up in one cluster or another.
Now, let’s look at our clustering centers:
mclustTest$parameters$mean [,1] [,2]
x1 4.742245 5.318003
x2 5.768518 6.719131
x3 1.599823 3.535407
x4 2.941652 3.296451
x5 4.294729 4.430997
x6 2.066251 2.421243
x7 4.155177 4.246587
x8 5.435710 5.707536
x9 5.289581 5.541104
This is giving us an idea about the center for each of the clusters.
plot(mclustTest, what = "classification")We can also look at the components graph (like we saw earlier):
coordProj(testData, parameters = mclustTest$parameters, z = mclustTest$z)Everything we have seen to this point needs some type of work to know how many clusters to find.
Conversely, hiearchical clustering will cluster without any input. These hierarchical clustering approach can start with every observation as its own cluster and then start connecting those individual clusters into larger clusters. This continues until every cluster is clustered under a larger cluster. This process is known as agglomerative clustering (the reverse, where everything starts in one big cluster and is broken down into smaller clusters, is known as divisive clustering).
hieararchicalTest = hclust(dist(testData))plot(hieararchicalTest)This monstrous-looking dendogram is showing us how the observations cluster all the way up the tree. The y-axis, Height, is telling us how similar observations are when they get clustered together – i.e., observations that are clustering at height 1 are much more similar than observations clustering at height 6.
If we want to get an idea about cluster membership, we need to determine where we want to cut our tree. For consistency with our other clustering algorithms, we can just go with defining two clusters:
hier2Clus = cutree(hieararchicalTest, k = 2)This provides the cluster membership so that we can identify the observations in our data that belong to a cluster:
head(testData[hier2Clus == 2, ]) x1 x2 x3 x4 x5 x6 x7 x8 x9
15 5.833333 5.75 3.625 5.000000 5.50 3.000000 4.000000 4.35 5.861111
19 5.666667 5.25 4.000 4.333333 5.25 3.714286 3.913044 4.85 5.750000
20 6.333333 8.75 3.000 3.666667 3.75 2.571429 3.478261 5.35 4.916667
67 4.000000 9.25 4.000 3.000000 3.75 1.000000 3.608696 5.75 6.194444
74 5.000000 6.00 3.250 3.333333 5.75 2.714286 3.478261 4.60 4.277778
82 4.666667 8.00 4.250 3.000000 4.75 2.428571 5.130435 5.75 5.305556
Finally, we could get the averages for each variable for both clusters:
rbind(cluster1 = testData[hier2Clus == 1, ] %>%
summarize_all(mean),
cluster2 = testData[hier2Clus == 2, ] %>%
summarize_all(mean)) x1 x2 x3 x4 x5 x6 x7
cluster1 4.722452 5.844008 2.042872 2.749311 4.047521 1.864227 4.120014
cluster2 5.810734 7.088983 3.101695 4.338983 5.542373 3.503632 4.456153
x8 x9
cluster1 5.421694 5.234848
cluster2 5.959322 5.945386
Just like everything else we have seen, we have a lower scoring group and a higher scoring group.
To this point, we have been using “hard” clustering – an observation is put into one cluster and there is no notion that the observation could belong to another. Fuzzy clustering changes this, by providing probabilities of class membership.
fuzzyTest = fanny(testData, k = 2, memb.exp = 1.25)We can get our clusters, just like we did with everything else:
head(fuzzyTest$clustering)[1] 1 1 1 2 1 1
But a key difference is that we can also get the membership coefficients
head(fuzzyTest$membership) [,1] [,2]
[1,] 0.5830297 0.41697030
[2,] 0.7306898 0.26931022
[3,] 0.9219575 0.07804253
[4,] 0.3316131 0.66838689
[5,] 0.7620418 0.23795817
[6,] 0.7734978 0.22650221
These can essentially be thought of as the probability of belonging to a cluster.
fviz_cluster(fuzzyTest)We can see that a great deal of the observations are pretty clear, but we can look at one of the observations that are resting between the 2 clusters.
fuzzyTest$membership[67, ][1] 0.4565806 0.5434194
We can see that it is not exactly 50/50, but it is pretty clear that this observation’s membership in cluster 2 is not 100% concrete.
We have covered a big slice of models that deal with some type of latent structure. Latent class analysis (LCA) differs from what we have previously seen. In factor analysis, for example, we are finding the latent variables that give cause to the measured items. LCA, on the other hand, attempts to use variables to find latent classes of people. For example, if we have variables regarding race, gender, education, and employment status, can we find naturally occuring groups that cluster together?
LCA has definite conceptual relationships to cluster analysis and factor analysis. From a technical standpoint, however, they are different. While we saw that certain variable types work better for different types of cluster analyses, categorical variables are the only variables that can be used in LCA (e.g., it could not use raw age – you would unfortuantely need to discretize it first). The latent classes that emerge in LCA are often noted as “profiles” (some people even call it latent profile analysis) – thinking of them as profiles allows you to think about constructing general profiles and grouping people into those profiles.
Let’s try to fit a model with 2 latent classes based upon education, gender, and party identification. We can use the election data from poLCA.
library(poLCA)
library(dplyr)
data("election")
lcaDat = election
lcaFormula = cbind(EDUC, GENDER, PARTY) ~ 1
lcaMod2 = poLCA(lcaFormula, lcaDat, 2, maxiter = 5000)Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$EDUC
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
class 1: 0.0445 0.0838 0.3899 0.2047 0.0995 0.1302 0.0475
class 2: 0.0182 0.0272 0.1472 0.2196 0.0863 0.3223 0.1793
$GENDER
Pr(1) Pr(2)
class 1: 0.3374 0.6626
class 2: 0.5809 0.4191
$PARTY
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
class 1: 0.2304 0.2139 0.1624 0.1673 0.0888 0.0768 0.0604
class 2: 0.1472 0.0715 0.1362 0.0357 0.1895 0.1832 0.2367
Estimated class population shares
0.5863 0.4137
Predicted class memberships (by modal posterior prob.)
0.608 0.392
=========================================================
Fit for 2 latent classes:
=========================================================
number of observations: 1755
number of estimated parameters: 27
residual degrees of freedom: 70
maximum log-likelihood: -7628.878
AIC(2): 15311.76
BIC(2): 15459.45
G^2(2): 89.21528 (Likelihood ratio/deviance statistic)
X^2(2): 86.54713 (Chi-square goodness of fit)
Our output is showing us our two classes, as we specified, and the probabilities of each variables response category belonging to a class. This will be explained in more depth shortly. The output also gives us our estimated population shares (how many are observed) and the predicted class membership (how many would be predicted).
We also get some fit statistics and those are interpreted as we would normally expect.
Seeing a plot of the class probabilities will be helpful as an explanatory tool and the help for the election data is critical for knowing what the levels are within the data:
EDUC is 1 (8th grade or less) through 7 (advanced degree), gender is 1 (male), and party is 1 (strong Democrat) to 7 (strong Republican).
plot(lcaMod2)If we had to put a face to class 1, we might imagine mostly men, with some level of college eduction, who trend Republican. Class 2 is largely women, with a very slightly trend towards Democrat, with a slightly less education.
One interesting thing about LCA is that an automated class selection mechanism really does not exist – you need to engage in some model comparison.
Let’s try models with 3, 4, and 5 classes.
lcaMod3 = poLCA(lcaFormula, lcaDat, 3, maxiter = 5000)Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$EDUC
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
class 1: 0.0000 0.0288 0.2160 0.2349 0.1296 0.2757 0.1149
class 2: 0.0805 0.0700 0.1452 0.1752 0.0001 0.3002 0.2288
class 3: 0.0513 0.0860 0.3922 0.1987 0.0868 0.1275 0.0576
$GENDER
Pr(1) Pr(2)
class 1: 0.4086 0.5914
class 2: 1.0000 0.0000
class 3: 0.3186 0.6814
$PARTY
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
class 1: 0.1005 0.1315 0.1240 0.0199 0.1699 0.1748 0.2794
class 2: 0.2087 0.0000 0.1672 0.0734 0.2157 0.1884 0.1466
class 3: 0.2776 0.2161 0.1721 0.2057 0.0733 0.0553 0.0000
Estimated class population shares
0.4138 0.1208 0.4654
Predicted class memberships (by modal posterior prob.)
0.384 0.0872 0.5288
=========================================================
Fit for 3 latent classes:
=========================================================
number of observations: 1755
number of estimated parameters: 41
residual degrees of freedom: 56
maximum log-likelihood: -7615.985
AIC(3): 15313.97
BIC(3): 15538.25
G^2(3): 63.42943 (Likelihood ratio/deviance statistic)
X^2(3): 63.31322 (Chi-square goodness of fit)
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
plot(lcaMod3)lcaMod4 = poLCA(lcaFormula, lcaDat, 4, maxiter = 5000)Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$EDUC
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
class 1: 0.0000 0.0000 0.0036 0.2239 0.1350 0.4074 0.2301
class 2: 0.0000 0.0652 0.5339 0.2625 0.0438 0.0946 0.0000
class 3: 0.1639 0.1330 0.1808 0.0547 0.2288 0.1251 0.1137
class 4: 0.1957 0.2135 0.4163 0.0789 0.0000 0.0422 0.0534
$GENDER
Pr(1) Pr(2)
class 1: 0.5766 0.4234
class 2: 0.3246 0.6754
class 3: 0.0000 1.0000
class 4: 1.0000 0.0000
$PARTY
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
class 1: 0.1560 0.1038 0.1374 0.0563 0.1720 0.1609 0.2136
class 2: 0.1882 0.1657 0.1757 0.1603 0.1266 0.0716 0.1119
class 3: 0.3032 0.3709 0.0901 0.0987 0.0296 0.1076 0.0000
class 4: 0.2789 0.0450 0.1673 0.1325 0.0959 0.2207 0.0596
Estimated class population shares
0.3711 0.4396 0.1078 0.0815
Predicted class memberships (by modal posterior prob.)
0.4376 0.433 0.0764 0.053
=========================================================
Fit for 4 latent classes:
=========================================================
number of observations: 1755
number of estimated parameters: 55
residual degrees of freedom: 42
maximum log-likelihood: -7601.741
AIC(4): 15313.48
BIC(4): 15614.34
G^2(4): 34.94146 (Likelihood ratio/deviance statistic)
X^2(4): 34.35936 (Chi-square goodness of fit)
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
plot(lcaMod4)lcaMod5 = poLCA(lcaFormula, lcaDat, 5, maxiter = 5000)Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$EDUC
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
class 1: 0.1585 0.1462 0.3503 0.0324 0.1510 0.0130 0.1485
class 2: 0.1920 0.2198 0.4563 0.1135 0.0008 0.0176 0.0000
class 3: 0.0000 0.0469 0.4691 0.2706 0.0698 0.1436 0.0000
class 4: 0.0000 0.0000 0.0001 0.1768 0.0898 0.3918 0.3416
class 5: 0.0000 0.0435 0.2192 0.2813 0.1456 0.2929 0.0175
$GENDER
Pr(1) Pr(2)
class 1: 0.0000 1.0000
class 2: 1.0000 0.0000
class 3: 0.3233 0.6767
class 4: 0.6588 0.3412
class 5: 0.3909 0.6091
$PARTY
Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
class 1: 0.3286 0.3430 0.1008 0.0950 0.0349 0.0977 0.0000
class 2: 0.2807 0.0536 0.1729 0.1378 0.0912 0.2122 0.0517
class 3: 0.1960 0.0858 0.2394 0.2080 0.1740 0.0000 0.0967
class 4: 0.1771 0.0495 0.1673 0.0594 0.2128 0.1277 0.2062
class 5: 0.1147 0.3298 0.0060 0.0142 0.0309 0.2903 0.2140
Estimated class population shares
0.1114 0.0831 0.3536 0.2393 0.2126
Predicted class memberships (by modal posterior prob.)
0.0957 0.0764 0.3812 0.2348 0.212
=========================================================
Fit for 5 latent classes:
=========================================================
number of observations: 1755
number of estimated parameters: 69
residual degrees of freedom: 28
maximum log-likelihood: -7595.543
AIC(5): 15329.09
BIC(5): 15706.53
G^2(5): 22.54605 (Likelihood ratio/deviance statistic)
X^2(5): 22.1376 (Chi-square goodness of fit)
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
plot(lcaMod5)Using each model’s AIC, let’s see which model might work the best:
rbind(class2 = lcaMod2$aic,
class3 = lcaMod3$aic,
class4 = lcaMod4$aic,
class5 = lcaMod5$aic) [,1]
class2 15311.76
class3 15313.97
class4 15313.48
class5 15329.09
There is not too much difference between 2 and 3 classes, so we could stick with parsimony and be happy with 2 classes.
Latent class analysis will also let us include a covariate in our model, essentially performing a regression. Let’s define a latent class based upon views of presidential candidates (demarked by a B or G after the variable name):
covariateFormula = cbind(LEADG, MORALG, LEADB, MORALB) ~ PARTY
intelPartyLCA = poLCA(covariateFormula, lcaDat, 2, maxiter = 5000)Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$LEADG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.2524 0.5765 0.1476 0.0234
class 2: 0.0118 0.2264 0.5240 0.2379
$MORALG
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.3821 0.5261 0.0736 0.0182
class 2: 0.0935 0.4234 0.3116 0.1716
$LEADB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.0621 0.4034 0.3756 0.1590
class 2: 0.3013 0.6316 0.0575 0.0096
$MORALB
1 Extremely well 2 Quite well 3 Not too well 4 Not well at all
class 1: 0.1016 0.4847 0.3178 0.0960
class 2: 0.3621 0.5576 0.0687 0.0117
Estimated class population shares
0.569 0.431
Predicted class memberships (by modal posterior prob.)
0.5696 0.4304
=========================================================
Fit for 2 latent classes:
=========================================================
2 / 1
Coefficient Std. error t value Pr(>|t|)
(Intercept) -5.65737 0.45270 -12.497 0
PARTY 1.32879 0.10246 12.969 0
=========================================================
number of observations: 1459
number of estimated parameters: 26
residual degrees of freedom: 229
maximum log-likelihood: -6438.342
AIC(2): 12928.68
BIC(2): 13066.11
X^2(2): 3273.342 (Chi-square goodness of fit)
We have the same output with regard to our class membership that we had before, but we also get some information about our regression model fit. In our model, we specified two latent classes; examining our latent class would lead us to assume that those in class 1 favor leader B, while those in class 2 favor leader G. In our regression output, we are comparing class 1 to class 2. We see our typical regression results and we know that it is a significant one, but what exactly does it mean?
To start, we are comparing the likelihood that a person will belong to class 2 as opposed to class 1 (that is what the 2 / 1 indicates). If we use our PARTY scores with our provided equation, we will get at the log-ratio prior probability:
classPriors = exp(cbind(1, c(1:7)) %*% intelPartyLCA$coeff)
priorProbs = cbind(1, classPriors) / (1 + rowSums(classPriors))
priorProbsData = data.frame(class1B = priorProbs[, 1],
class2G = priorProbs[, 2],
partyValue = 1:7) %>%
tidyr::gather(.,key = class, value = probability, -partyValue)
library(ggplot2)
ggplot(priorProbsData, aes(x = partyValue, y = probability, color = class)) +
geom_line() +
theme_minimal()Recall that PARTY is coded 1 (strong Democrat) to 7 (strong Republican). Someone with a PARTY value of 1 would have a near probability of 0 to belonging to class 1 and a near 1 probability of belonging to class 2. Conversely, someone with a party value of 7 would have a near 1 probability to belong to class 1 and a near 0 probability of for class 2.
Using this added feature of latent class analysis can help to make more sense of the classes and give you a better idea about how people within those classes might behave.
Most of the statistical shenanigans you have seen to this point has come from the parametric family. In other words, we are making assumptions about the underlying distributions. What if we don’t make any assumptions or we really have no idea and we want to let it be defined by our actual data? Then, we are operating in a non-parametric space. How do we let our data do the talking?
There are many types of smooths. You will frequently see the loess (local regression – sometimes you will hear it as locally-weighted scatterplot smoothing or lowess). With a loess line, we are fitting some polynomial (generally the linear or the quadratic) to a small section of our data at a time (i.e., a local group) – this is a little bit more complicated than our moving average window type of smooth. Each small section has an associated line and each line gets joined with the line in the next group (these are referred to as knots). Since we are largely in control here, we get to specify how wiggly things might get.
You will also see regression splines (largely what we will be using here today). The great thing about these is that we can penalize them!
Very briefly, an additive model is not much different than our normal interpretation of a model. In our additive model, we can look at the affect of a predictor on a dependent variable without any consideration for what other variables might be in the model. We can add these effects to best predict our response.
GMAT TIME – lm is to glm, as additive models are to…?
During the last few weeks, we have largely been working in the generalized linear models framework. We are going to stay in the general vicinity, but start moving to some more interesting places! We have mostly seen straight lines being fitted to various things. As many of you have likely noted, it often seems like a straight line doesn’t really fit the relationships that we can see within our data. So, what do you do? We could always go the transformation route, but that seems a bit antiquated at this point…don’t you think?
What if we fit a smooth line to our data instead of trying to jam a single straight line somewhere it does not want to be or do something like throwing a single quadratic term into the model? Now we are doing something interesting.
The car package is old-school R, but still has some handy stuff for us.
library(car)
library(dplyr)
noLeaders = read.csv("leadershipRatingsAgreement.csv")
scatterplotMatrix(noLeaders[, !(names(noLeaders) %in% c("leaderID"))])The splom that we just saw gives us a really good idea about the relationships within the data. The green line is a linear line and the red line is a smoothed line. If those are not sitting on top of each other, then you might want to think carefully about the relationship that is present.
plotDat = noLeaders %>%
dplyr::select(effect, enabling) %>%
na.omit
plot(effect ~ enabling, data = plotDat)
lines(sort(plotDat$enabling),
fitted(lm(effect ~ enabling,
data = plotDat))[order(plotDat$enabling)], col = "red")
lines(sort(plotDat$enabling),
fitted(lm(effect ~ I(enabling^2),
data = plotDat))[order(plotDat$enabling)], col = "blue")
lines(sort(plotDat$enabling),
fitted(lm(effect ~ I(enabling^3),
data = plotDat))[order(plotDat$enabling)], col = "green")The preceding figure shows us 3 different lines: a linear regression line, and two higher-order trends. We will use them as a reference.
Let’s check this out:
lmTest = lm(effect ~ enabling, data = noLeaders)
summary(lmTest)
Call:
lm(formula = effect ~ enabling, data = noLeaders)
Residuals:
Min 1Q Median 3Q Max
-1.6508 -0.3554 -0.1035 0.2485 3.6317
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.71439 0.01218 58.66 <2e-16 ***
enabling 0.91594 0.03018 30.35 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5379 on 4619 degrees of freedom
(265 observations deleted due to missingness)
Multiple R-squared: 0.1663, Adjusted R-squared: 0.1661
F-statistic: 921.2 on 1 and 4619 DF, p-value: < 2.2e-16
Nothing too new here, so let’s move along!
library(mgcv)
gamTest = gam(effect ~ enabling, data = noLeaders)
summary(gamTest)
Family: gaussian
Link function: identity
Formula:
effect ~ enabling
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.71439 0.01218 58.66 <2e-16 ***
enabling 0.91594 0.03018 30.35 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.166 Deviance explained = 16.6%
GCV = 0.28951 Scale est. = 0.28939 n = 4621
You should notice that there is no difference between our standard linear model and our gam with regard to the coefficient. If we do not smooth a variable, it gets treated just like it would in a linear regression model. We also get some output such as adjusted R^2 (interpreted as per normal) and we also have deviance explained, which is giving us very similiar information to adjusted R^2 (instead of looking at the sums of square error between fitted and observed, it just uses a different error calculation). The scale estimate, in this case, is the residual standard error squared. GCV is the minimized generalised cross-validation and it gives us an idea about our prediction error (ideally, we want this to be a small value).
Let’s try to smooth. In the following code, you will notice how we wrapped out term in s(). Believe it or not, this is to specify a smooth term. We could spend a whole week on different ways to smooth things, but we will just stick with s() and its defaults for now.
gamTestSmooth = gam(effect ~ s(enabling), data = noLeaders)
summary(gamTestSmooth)
Family: gaussian
Link function: identity
Formula:
effect ~ s(enabling)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.995363 0.007893 126.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(enabling) 4.926 6.073 156.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.17 Deviance explained = 17.1%
GCV = 0.28827 Scale est. = 0.2879 n = 4621
After smoothing our term, we can see that our output has changed. Instead of getting a linear regression coefficient, we get an edf (estimated degrees of freedom). While these edf values lack the clean interpretation of our linear regression coefficients, we can still get a great deal of information from them. The closer edf is to 1, the more linear in nature the term actually is. However, as edf goes beyond 1, we have an increasingly wigglier relationship.
Since we included a smooth term, we can see that our model fit has improved from our previous gam without a smooth term.
If we plot our newly-fitted gam model back onto our previous visualization, here is what we get:
plot(effect ~ enabling, data = plotDat)
lines(sort(plotDat$enabling),
fitted(lm(effect ~ enabling,
data = plotDat))[order(plotDat$enabling)], col = "red")
lines(sort(plotDat$enabling),
fitted(lm(effect ~ I(enabling^2),
data = plotDat))[order(plotDat$enabling)], col = "blue")
lines(sort(plotDat$enabling),
fitted(lm(effect ~ I(enabling^3),
data = plotDat))[order(plotDat$enabling)], col = "green")
lines(sort(plotDat$enabling),
fitted(gam(effect ~ s(enabling),
data = noLeaders))[order(plotDat$enabling)],
col = "orange") The soft orange line is our gam fit. We can see that it does not rocket upwards, like our higher-order terms, but is instead capturing a bit of the downward trend towards the larger values of the enabling variable.
The wiggle can be controlled and you are the one to control it (all models are your monster, so build them in a way that you can control it). An important consideration to make with the wiggle (and with almost all of our decision from here on out) is the bias/variance trade-off. You will see this called other things (e.g., error/variance), depending on with whom you are hanging around. Since we have only talked about bias briefly, we do not need to worry about getting bias in this sense conflated with anything else.
It works like this: you cannot have your cake and eat it too. Do you want your in-sample predicition to be awesome (low bias)? Great! You can count on getting that at the expense of higher variance. The lower the variance, the better your model will predict new data. Well that sounds easy – just go with the lowest variance. But…that might contribute to missing some weird pattern. Again, it is just a decision to make (you likely won’t be facing off with your monsters in the Arctic in the end).
With our gam models, the wigglier your line, the lower your bias will be and the better you are doing at predicting in sample.
library(ggplot2)
gamTestLambda1 = gam(effect ~ s(enabling, sp = 0, k = 40), data = noLeaders)
p = predict(gamTestLambda1, type = "lpmatrix")
beta = coef(gamTestLambda1)
s = p %*% beta
plotDat = cbind.data.frame(s = s, enabling = na.omit(noLeaders$enabling))
gam1Plot = ggplot(plotDat, aes(enabling, s)) +
geom_line(color = "#ff5500") +
geom_point(data = noLeaders, aes(enabling, effect), alpha = .5) +
theme_minimal()
gamTestLambda9 = gam(effect ~ s(enabling, sp = .9, k = 40), data = noLeaders)
p = predict(gamTestLambda9, type = "lpmatrix")
beta = coef(gamTestLambda9)
s = p %*% beta
plotDat = cbind.data.frame(s = s, enabling = na.omit(noLeaders$enabling))
gam9Plot = ggplot(plotDat, aes(enabling, s)) +
geom_line(color = "#ff5500") +
geom_point(data = noLeaders, aes(enabling, effect), alpha = .5) +
theme_minimal()
library(gridExtra)
gridExtra::grid.arrange(gam1Plot, gam9Plot)In the top plot, we have allowed our line a bit more flexibility to wiggle – you can see the line bending more to fit the pattern within your data. We are going to get very good in-sample prediction here, at the expense of out-of-sample prediction. The bottom plot, is a bit more reserved. It will undoubtedly do better out-of-sample, but might be missing something within the in-sample data.
Decision trees are great for classification (i.e., which category does something belong to). For the type of decision tree that we are going to use, we can think about it almost like we would a logistic model. There are, however, some major differences. The first is that we feed any number of predictors to the tree. At our root node, we will have all of the data and all of the predictors. A scan of the predictor variables is then computed that will result in a “pure” branch – it will select a variable and a split of that variable that will result in a good separation between “0” and “1”. We then create another branch and repeat the process. We do this until we have all of the observations classified.
One beauty of the decision tree is in its ability to select important variables. One drawback, however, is that it does this in a greedy fashion. It finds what splits best and goes with it at that node; it does this without consideration of what variables might help to split better later on.
Another great thing about these trees is the ability to use classification or regression.
library(RCurl)
url = "https://raw.githubusercontent.com/gastonstat/CreditScoring/master/CleanCreditScoring.csv"
creditScores = getURL(url)
creditScores = read.csv(textConnection(creditScores))library(caret)
library(rpart)
library(rpart.plot)
library(e1071)
set.seed(10001)
tree1 = rpart(Status ~ Income + Savings + Debt,
data = creditScores, method = "class")
rpart.plot::prp(tree1, box.palette = "Blues", extra = 8)If we notice the “extra” argument in our prp function, we will see an 8. This option is giving us the probability of belonging to that class (e.g., for everyone with a savings over 2, there is a .78 percent chance of them having good credit). There are many options, so be sure to look at the help file for the different options and play around with them.
confusionMatrix(predict(tree1, type = "class"),
creditScores$Status)Confusion Matrix and Statistics
Reference
Prediction bad good
bad 131 92
good 1118 3105
Accuracy : 0.7278
95% CI : (0.7145, 0.7409)
No Information Rate : 0.7191
P-Value [Acc > NIR] : 0.09917
Kappa : 0.1015
Mcnemar's Test P-Value : < 2e-16
Sensitivity : 0.10488
Specificity : 0.97122
Pos Pred Value : 0.58744
Neg Pred Value : 0.73526
Prevalence : 0.28093
Detection Rate : 0.02946
Detection Prevalence : 0.05016
Balanced Accuracy : 0.53805
'Positive' Class : bad
Let’s check one out with some more variables:
set.seed(10001)
noRecodes = creditScores %>%
dplyr::select(-ends_with("R"))
bigTree = rpart(Status ~ ., data = noRecodes,
method = "class")
rpart.plot::prp(bigTree, box.palette = "Blues", extra = 8)confusionMatrix(predict(bigTree, type = "class"),
noRecodes$Status)Confusion Matrix and Statistics
Reference
Prediction bad good
bad 480 182
good 769 3015
Accuracy : 0.7861
95% CI : (0.7737, 0.7981)
No Information Rate : 0.7191
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3821
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.3843
Specificity : 0.9431
Pos Pred Value : 0.7251
Neg Pred Value : 0.7968
Prevalence : 0.2809
Detection Rate : 0.1080
Detection Prevalence : 0.1489
Balanced Accuracy : 0.6637
'Positive' Class : bad
And we can mess around with a few different parameters. The “cp” argument represents the complexity parameter (whether a node decreases the lack of fit) and the minsplit represents the minimum number of observations that can be in any bucket.
set.seed(10001)
noRecodes = creditScores %>%
dplyr::select(-ends_with("R"))
overgrownTree = rpart(Status ~ ., data = noRecodes,
method = "class",
control = rpart.control(cp = 0,
minsplit = 5))
rpart.plot::prp(overgrownTree, box.palette = "Blues", extra = 8)confusionMatrix(predict(overgrownTree, type = "class"),
noRecodes$Status)Confusion Matrix and Statistics
Reference
Prediction bad good
bad 1139 81
good 110 3116
Accuracy : 0.957
95% CI : (0.9507, 0.9628)
No Information Rate : 0.7191
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8929
Mcnemar's Test P-Value : 0.04276
Sensitivity : 0.9119
Specificity : 0.9747
Pos Pred Value : 0.9336
Neg Pred Value : 0.9659
Prevalence : 0.2809
Detection Rate : 0.2562
Detection Prevalence : 0.2744
Balanced Accuracy : 0.9433
'Positive' Class : bad
We can see what happens when we reduce the number of observation in any one split and we completely relax the need to improve our fit at every node – our tree is allowed to grow almost unchecked. Although our decision tree is now absolutely crazy and nearly impossible to follow, we have greatly improved our accuracy. Again, this in-sample accuracy is likely going to come with the cost of poor out-of-sample accuracy.
For the love of all that is holy, is this really necessary? Here is a big hint – if you have a limited set of predictors with clearly linear relationships that are bound by concrete theory, then a linear regression model will outperform these trees. If, on the other hand, you have more than a few predictors and no clear theory to guide variable selection, then the trees will be great.
Decision trees are clearly a lot of fun and have a wide variety of uses. While sometimes one tree is helpful, everything I know about horror movies leads me to believe that a single tree (perhaps on a desolate hill) always leads to some type of danger. If we remember the issue with trees being greedy and just splitting things how it finds it best, we can imagine what might happen if we only consider one tree.
This brings us to ensemble methods. If we think about what an ensemble is, we can start to guess what ensemble methods might be. There are many ensemble methods, but the one we are going to talk about now is random forests. Trees..forest…ensemble…are the connections starting to happen?
Forest is probably less pretentious than “tree ensemble”, but where does the “random” come into play? What would be the likely result of running the exact same tree 1000 times? Some minor differences aside, almost nothing. In our random forest, we will create randomness in 2 ways: by drawing a random sample of observations for each tree and by drawing a random sample of predictors for each tree.
library(randomForest)
set.seed(10001)
rfTest = randomForest(Status ~ ., data = noRecodes,
importance = TRUE,
ntree = 1000)
rfTest
Call:
randomForest(formula = Status ~ ., data = noRecodes, importance = TRUE, ntree = 1000)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 3
OOB estimate of error rate: 20.78%
Confusion matrix:
bad good class.error
bad 624 625 0.50040032
good 299 2898 0.09352518
Here we see some simple summary output, with our out-of-bag error rate and the same type of confusion matrix we saw earlier. You might be wondering what “out-of-bag” means, and rightfully so! What makes the random forest random? Instead of wasting the unsampled data, it uses it as a testing set (i.e., the data that was not in the sample “bag” gets used to help perform a very basic cross validation).
We can also get the more elaborate confusion matrix from caret (note that the matrix is flipped from the previous output!):
confusionMatrix(rfTest$predicted, noRecodes$Status, positive = "good")Confusion Matrix and Statistics
Reference
Prediction bad good
bad 624 299
good 625 2898
Accuracy : 0.7922
95% CI : (0.7799, 0.804)
No Information Rate : 0.7191
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4412
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9065
Specificity : 0.4996
Pos Pred Value : 0.8226
Neg Pred Value : 0.6761
Prevalence : 0.7191
Detection Rate : 0.6518
Detection Prevalence : 0.7924
Balanced Accuracy : 0.7030
'Positive' Class : good
Next, let’s look at our variable importance:
importance(rfTest) bad good MeanDecreaseAccuracy MeanDecreaseGini
Seniority 45.1052118 32.213205 55.891401 177.68237
Home 18.3746102 23.034527 31.121596 84.76986
Time 9.2540182 21.972602 24.889852 59.91084
Age -0.3968928 24.139723 20.981706 128.09684
Marital 5.4525639 9.937768 12.344187 36.81541
Records 67.0224249 57.555206 81.899248 115.10560
Job 43.8688689 46.721606 62.004885 105.26650
Expenses 1.2742437 20.331422 18.725488 84.03249
Income 9.5274372 34.550613 40.144686 159.98766
Assets 32.5556150 32.230098 47.235940 112.50443
Debt 4.4347507 4.077064 6.054631 37.13761
Amount 12.8418494 31.957065 37.353952 138.63217
Price 3.9116821 25.000101 26.328744 150.57211
Finrat 25.3953746 35.631124 44.856744 195.33558
Savings 25.7414402 39.852088 52.361732 196.66277
varImpPlot(rfTest)We can look at the “MeanDecreaseAccuracy” and the “MeanDecreaseGini” to get a sense of how important each variable might be. The help file for importance sheds light on how “MeanDecreaseAccuracy” is computed, but all you really need to know is that it is telling us how poorly a model performed without the variable (variables with a higher value are more important for predicition). Gini is always used as an inequality measure, but it is conceptualized as node impurity here – variables with higher values means that removing those variables resulted in greater node impurity.
Now we know what variables are important for our random forest, so when do we interpret our trees from the forest!?! We can’t…so we don’t!!! Interpreting a single tree is statistically non-sense when we are dealing with a random forest. We are really only concerned with prediction at this point.
Outside of Breiman, that does not satisfy anybody.
# devtools::install_github('araastat/reprtree')
library(reprtree)
repTree = ReprTree(rfTest, noRecodes)[1] "Constructing distance matrix..."
[1] "Finding representative trees..."
plot(repTree)What happened? While we do not want to overgrow a tree, a random forest does not care in the slightest. In fact, it will grow the deepest tree that it possibly can. If you would say that this causes over-fitting, then you would be absolutely correct. Remember, though, that we are generating a lot of trees and using all of these trees to votes on the outcomes. Therefore, the over-fitting within one tree really does not make a big differenece – it will come out in the wash. But, you can see where interpretting a tree from a forest is practically nonsense!
While interpretting a tree is indeed nonsense, we can do some work with exploring what happens within the tree and specific observations a bit more. This is certainly not requisite, thus the output of the code is not provided, but you could use the following code from the lime package to do some exploring on your own.
library(lime)
sampleObs = sample.int(n = nrow(noRecodes),
size = floor(.75 * nrow(noRecodes)), replace = F)
train = noRecodes[sampleObs, ]
test = noRecodes[-sampleObs, ]
modelTrain = train(Status ~ ., data = train, method = 'rf')
explainer = lime(train, modelTrain)
explanation = explain(test, explainer, n_labels = 2, n_features = 3)
head(explanation)
plot_features(explanation)This is but one type of classification tree/random forest scenario. The party package has some different tree/forest algorithms. Let’s take a look at a random forest fit with conditional inference. Instead of using something like the Gini index to maximize information, conditional inference trees use significance testing to perform the splits.
library(party)
crfTest = cforest(Status ~ ., data = noRecodes,
controls = cforest_control(ntree = 1000))Let’s see if we did any better:
confusionMatrix(predict(crfTest), noRecodes$Status, positive = "good")Confusion Matrix and Statistics
Reference
Prediction bad good
bad 824 132
good 425 3065
Accuracy : 0.8747
95% CI : (0.8646, 0.8843)
No Information Rate : 0.7191
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.666
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9587
Specificity : 0.6597
Pos Pred Value : 0.8782
Neg Pred Value : 0.8619
Prevalence : 0.7191
Detection Rate : 0.6894
Detection Prevalence : 0.7850
Balanced Accuracy : 0.8092
'Positive' Class : good
We certainly did! This all boils down to tweaking the parameters of your forest. It is not terribly hard to make incremental improvements to your prediction rates, but sometimes you can make pretty significant jumps just by switching up the way the nodes are split.
We could get a sample tree from this forest, but we are not going to do it. One important thing to note is that party will not create a deep tree by default.
While provacative statements are often fun to make, you will not get one here. Instead, I would prefer to play it safe and make an assumption that you have used some form of online commerce during the past year. Maybe you bought something off Amazon or you used Netflix to make poor choices around bedtime. If you have used any such services, you have interacted with a recommender system. Recommender systems are a class of models designed to make predictions about items that people might like/endorse/want, despite never having seen those items. It is exactly how Netflix makes recommendations (see how that works?) and how Amazon gives you items inspired by your viewing history.
Everytime you click, watch, rate, or buy, you are helping to train these recommender systems. Imagine yourself as “Person A”. As “Person A”, you have rated the show Ash Vs. Evil Dead highly (and watched every available episode). Another person, likely “Person B”, has done the same. “Person B” also provided a very high rating for the movie Bubba Ho-tep. Since you and “Person B” both enjoyed Ash Vs. Evil Dead, a video service would likely recommend Bubba Ho-tep to you.
While our little contrived scenario serves as a clean example, recommender systems are great when there are many items (as is the case with nearly every online entity).
While we are going to use recommenderlab (for ease and speed), recosystem is an excellent package and one that you should keep in mind.
install.packages("recommenderlab")
install.packages("recosystem")
devtools::install_github("tarashnot/SlopeOne")
devtools::install_github("tarashnot/SVDApproximation")Like most statistical techniques, recommender systems are a veritable 31 flavors. However, your data and needs will largely dictate the precise method that is used.
Sometimes, items have a discreet set of descriptors and users can endorse those descriptors that they like. For example, you might like music that evokes memories of cruising in your Testerosta under pink and blue neon lights (possibly in Miami). There is a whole genre of music that fits this bill and is tagged appropriately on platforms like Pandora. When you offer a thumbs up or down, then you are helping to train the system about what you like and what features of the music are important to you. If Pandora plays a song without synthesizers and I provide a thumbs down, then the system knows that I might not like songs without synth. Next, I will likely give a thumbs up to something with synth. Finally a will give a thumbs down to a track with synth and vocals. Pandora now knows that I will put a large weight on the synth feature and they will continue offering me content accordingly. Depending upon the system, I might run into an issue with diversity if I am too specific or all of my weight is on one feature.
The content-based approach works well when there is just a single content type to be offered (music, movies, news, etc.). However, they do not perform well when the content is mixed (how well do you think Pandora could recommend news to you given your musical preferences). To that end, other techniques might be considered.
Fortunately, collaborative filtering tells us what is going on by name alone! Through a collaborative effort, items are filtered to you – people who act like you help to filter better predicted items to you. Our previous example about Ash Vs. Evil Dead is an excellent example of collaborative filtering.
Another interesting thing to note, though, is that we have two sources of information in a collaborative filtering model: the user and the item. The goal in both is the same, but the mechanism is slightly different.
In user-based collaborative filtering models, the goal is the find another user that shares similar patterns with the target user. At its root, it is very much akin to distance-based Nearest Neighbor matching. In small systems, user-based models can work well. However, large and/or dynamic systems are not kind to user-based models. Computing a similarity between every user in large systems is expensive and this can be compounded when users preference change/update. Another issue – one that Amazon identified quickly – is that having many items but few ratings of the items can perform poorly. To that end, some brilliant folks from Amazon decided to try something new.
Instead of focusing on similarities between users, item-based models (also called item-item models) focus on the mean ratings of the items. At a certain point, the mean of the item won’t bounce around too terribly much and then similarities can be computed between items. If I purchase an item, there is a high probabilitiy that I would also enjoy an item that is similarly rated.
What happens when you are brand new to anything that uses collaborative filtering? Can any reasonable predictions be made? Not really. This is an issue with data sparsity. Imagine a matrix with every person that has bought something from Amazon on the rows and every item from Amazon on the columns – this would be an incredibly large and sparse matrix. That is why some services ask you some questions from the get-go and Netflix throws some popular things out that might be of interest to you. Given the massive amounts of metadata that you produce (time of day, navigation patterns, devices used), many systems are never without some kind of context, but concrete behavior is very helpful.
Another way to deal with sparsity is to perform what is frequently known as model-based collaborative filtering. It sounds fancy, but it really isn’t anything that you haven’t covered already! Imagine that we take our matrix and perform some type of dimension reduction on it.
While your particular needs will mostly drive the decision, we do not always have to pick just one method and stick with it forever. Instead, we can create hybrid systems. These systems allow us to combine pieces of both of collaborative filtering and content-based models to achieve better predictions. If you were to come up with a video service where you allowed users to rate videos and show them videos with similar attributes to those that are rated highly (pulling from content-based filtering) and also take into account how a users viewing/searching habits are like other people (from collaborative filtering), you might just have a million dollar idea on your hands…I wonder why nobody has done such a thing…
It is very important to note that the example here is going to be very generic. You might wonder what the fun is in that and I would certainly wonder the same thing. As noted before, recommender systems are computationally expensive. You will notice that we are running our model on a very small set (20%) of the data – you want to actually be able to finish this model and see results. If your machine is a bit older, it might chug on the models.
We are going to demo both main types of the collaborative filtering model.
We can get the top recommendations for a few people given a trained model and predict ratings:
library(recommenderlab)
data("Jester5k")
ubcfTest = Recommender(Jester5k[1:1000], method = "UBCF")
ubcfRecom = predict(ubcfTest, Jester5k[1001:1005])
as(ubcfRecom, "list")$u20089
[1] "j10" "j55" "j1" "j72" "j93" "j100" "j81" "j9" "j3" "j70"
$u11691
[1] "j89" "j98" "j93" "j91" "j92" "j73" "j76" "j100" "j88" "j74"
$u2646
[1] "j14" "j72" "j12" "j6" "j89" "j83" "j78" "j96" "j76" "j45"
$u8426
[1] "j89" "j100" "j98" "j94" "j97" "j90" "j72" "j88" "j93" "j84"
$u18250
[1] "j89" "j76" "j93" "j91" "j96" "j88" "j100" "j85" "j98" "j83"
ubcfPredictRating = predict(ubcfTest, Jester5k[1001:1005],
type = "ratings")
as(ubcfPredictRating, "list")$u20089
j1 j2 j3 j4 j9 j10
-0.18606472 -1.69237808 -0.61481930 -2.39912551 -0.51535708 0.68832207
j22 j23 j24 j25 j30 j33
-2.04607082 -1.33763398 -1.38571982 -1.46730127 -1.21279006 -2.33520117
j37 j43 j44 j45 j47 j51
-2.43506322 -1.96736262 -1.07455739 -1.11121356 -1.07954026 -2.29830266
j52 j55 j57 j58 j59 j60
-2.29146789 -0.03947522 -3.12376136 -3.96257205 -2.44887694 -1.85488548
j63 j64 j67 j70 j71 j72
-2.07872950 -1.77801425 -2.11980871 -0.70987087 -2.58715566 -0.21825293
j73 j74 j75 j76 j77 j79
-1.52219624 -2.46801224 -2.39158643 -1.20860830 -2.49307746 -2.93104466
j80 j81 j82 j83 j84 j85
-1.38176987 -0.47794431 -3.38405423 -2.15290598 -0.89489739 -0.75140801
j86 j87 j89 j90 j91 j92
-2.87919467 -0.91506529 -0.89539157 -3.25045341 -1.74826816 -1.98510824
j93 j94 j95 j96 j97 j98
-0.27227971 -1.67751644 -1.08853718 -2.12969902 -1.17342305 -1.09561458
j99 j100
-2.05330591 -0.35511837
$u11691
j3 j4 j9 j24 j30 j33
-0.10997420 -0.62247064 -0.85521276 -2.09410293 -1.18183580 -2.19240862
j37 j41 j57 j58 j59 j60
-2.64312242 -1.60523436 -3.35176329 -3.00026607 -2.05958717 -2.89155860
j64 j71 j73 j74 j75 j76
-1.03516334 -0.50982149 0.63875251 0.37566534 0.31813915 0.61790196
j77 j78 j80 j81 j82 j83
0.04386737 0.03776578 0.18632927 -0.37789176 -0.32844826 0.09008047
j84 j85 j86 j87 j88 j89
-0.27971397 -1.09259219 -0.08705976 0.07758730 0.41660109 1.05256999
j90 j91 j92 j93 j94 j95
-0.99064777 0.74433759 0.64981841 0.94011006 -1.03811877 -0.25782565
j96 j97 j98 j99 j100
0.28986370 -0.62320836 0.97541566 -1.24293609 0.46506480
$u2646
j1 j2 j3 j4 j6 j9
-1.8158835 -1.8369514 -3.2861250 -3.0950636 -0.9599751 -3.0700649
j10 j11 j12 j14 j22 j24
-2.0200899 -2.3860338 -0.7247804 -0.3873996 -2.8043553 -2.4522703
j25 j30 j33 j37 j38 j40
-1.9776526 -2.0821196 -2.8677597 -3.1480525 -1.7307025 -2.0185173
j41 j43 j44 j45 j51 j55
-2.8900902 -2.9653842 -3.8222027 -1.5826495 -2.5325435 -2.0997833
j57 j58 j60 j63 j64 j67
-2.9362257 -3.6407675 -3.7075817 -3.2844756 -3.7008180 -2.3348311
j70 j71 j72 j73 j74 j75
-2.1795389 -3.2130440 -0.6115631 -1.9066992 -2.5006417 -2.6763121
j76 j77 j78 j79 j80 j81
-1.5779565 -1.8523654 -1.4364049 -1.9853804 -1.9916548 -2.3136842
j82 j83 j84 j85 j86 j87
-3.1278709 -1.3424441 -2.7792060 -2.3485366 -3.3417027 -1.6187393
j88 j89 j90 j91 j92 j93
-2.2341518 -1.0468588 -3.7990946 -2.5124673 -1.7996271 -1.6032134
j94 j96 j97 j98 j99 j100
-2.8486634 -1.5002727 -2.3749898 -2.4496628 -2.5890311 -2.9351828
$u8426
j71 j72 j73 j74 j75 j76 j77 j78
1.763926 2.844330 2.064189 1.945741 2.026668 2.618147 2.087283 2.305231
j79 j80 j81 j82 j83 j84 j85 j86
2.683440 2.717686 1.972459 2.226532 2.770031 2.815541 2.573262 2.108831
j87 j88 j89 j90 j91 j92 j93 j94
2.546303 2.843528 3.196710 2.857387 2.792360 2.773411 2.819905 3.078388
j95 j96 j97 j98 j99 j100
2.446361 2.526136 3.017407 3.093657 2.495765 3.123495
$u18250
j71 j72 j73 j74 j75 j76 j77 j78
5.947317 6.299860 5.994378 6.076744 5.847068 6.550990 5.864459 6.174978
j79 j80 j81 j82 j83 j84 j85 j86
6.100444 6.021890 6.308158 6.119615 6.331735 5.893564 6.369233 6.118635
j88 j89 j90 j91 j93 j94 j95 j96
6.375128 6.605151 6.071887 6.397380 6.528446 5.913060 6.258110 6.382495
j97 j98 j99 j100
5.853568 6.337283 6.093690 6.371520
ibcfTest = Recommender(Jester5k[1:1000], method = "IBCF")
ibcfRecom = predict(ibcfTest, Jester5k[1001:1005])
as(ibcfRecom, "list")$u20089
[1] "j97" "j90" "j70" "j30" "j86" "j59" "j63" "j94" "j82" "j71"
$u11691
[1] "j33" "j37" "j59" "j97" "j64" "j57" "j41" "j86" "j95" "j94"
$u2646
[1] "j88" "j76" "j96" "j92" "j94" "j93" "j78" "j55" "j30" "j84"
$u8426
[1] "j79" "j84" "j86" "j71" "j74" "j75" "j77" "j95" "j94" "j98"
$u18250
[1] "j84" "j82" "j94" "j71" "j95" "j85" "j86" "j99" "j90" "j77"
ibcfPredictRating = predict(ibcfTest, Jester5k[1001:1005],
type = "ratings")
as(ibcfPredictRating, "list")$u20089
j1 j2 j3 j4 j9 j10
-3.6205321 -7.5229129 -1.4391137 -1.3889552 -0.6189486 -5.6755132
j22 j23 j24 j25 j30 j33
-2.5407151 -2.9002208 -0.7644542 -2.9160664 1.6783319 -1.9385028
j37 j43 j44 j45 j47 j51
-0.6601058 -0.6508645 -0.5383921 -0.8501978 -2.7539340 -2.6863429
j52 j55 j57 j58 j59 j60
-1.7733936 -2.3186280 -0.6461654 -1.0724435 0.4323534 -0.9114580
j63 j64 j67 j70 j71 j72
0.4147624 -0.1481822 -0.5985703 1.7908505 0.2509258 -0.8960271
j73 j74 j75 j76 j77 j79
-1.2345176 -0.9094863 -0.3830234 -2.5212052 -0.8324549 -0.7450601
j80 j81 j82 j83 j84 j85
-1.4661141 -1.1085947 0.2600464 -2.7033535 -1.5685415 -3.2818597
j86 j87 j89 j90 j91 j92
1.1709215 -4.8808445 -0.4856941 2.1166004 -0.6167418 -4.4546985
j93 j94 j95 j96 j97 j98
-1.5267968 0.4114216 -1.7620931 -2.4643837 2.9106994 -2.6079573
j99 j100
-1.0630701 -1.0080002
$u11691
j3 j4 j9 j24 j30 j33
0.32640060 1.33685601 0.27681377 1.11634329 0.79762786 2.54928232
j37 j41 j57 j58 j59 j60
2.03821443 1.61470402 1.71655653 1.16267661 1.86224492 1.40970408
j64 j71 j73 j74 j75 j76
1.83293676 0.68732601 -0.81125146 1.15857126 -0.09910178 -1.15389516
j77 j78 j80 j81 j82 j83
0.10436067 -1.90398599 -1.11375220 0.70627679 0.95424946 -0.86125941
j84 j85 j86 j87 j88 j89
0.74540963 -0.07566627 1.59732233 -1.23424746 0.44761557 0.65275203
j90 j91 j92 j93 j94 j95
0.71269923 -0.44761401 -1.10114635 -0.55724121 1.41071449 1.45644584
j96 j97 j98 j99 j100
1.25910972 1.83589555 0.33375841 1.08776853 -0.94569395
$u2646
j1 j2 j3 j4 j6 j9
-3.7643646 -2.9512010 -1.5129619 -1.8988503 -4.6633212 -2.6120816
j10 j11 j12 j14 j22 j24
-1.8505295 -3.4667847 -2.2063839 -2.4884973 -1.4910869 -1.6602673
j25 j30 j33 j37 j38 j40
-2.0636546 -0.4758708 -2.0368016 -3.3075357 -1.3970700 -3.9081927
j41 j43 j44 j45 j51 j55
-1.8286770 -2.1299526 -2.5018919 -1.2410665 -2.9227329 -0.3607317
j57 j58 j60 j63 j64 j67
-2.4147751 -2.1266107 -1.3062268 -2.7773672 -2.0002061 -2.5906979
j70 j71 j72 j73 j74 j75
-2.6119790 -3.0746560 -2.1468924 -3.0213681 -2.2740970 -1.9417306
j76 j77 j78 j79 j80 j81
0.9866539 -1.9556999 -0.3257765 -0.7807587 -5.2186093 -1.5808528
j82 j83 j84 j85 j86 j87
-1.7458272 -3.7245876 -0.6615445 -2.8486741 -1.3416623 -1.9940062
j88 j89 j90 j91 j92 j93
1.4857077 -2.0627184 -1.7362316 -2.9726296 0.1470699 -0.1996794
j94 j96 j97 j98 j99 j100
0.1097609 0.4660366 -2.3281891 -1.2608227 -1.8464379 -3.0624484
$u8426
j71 j72 j73 j74 j75 j76
4.48489677 0.97993006 0.17113622 3.97988881 3.97526304 1.66718290
j77 j78 j79 j80 j81 j82
3.92868596 0.79615452 4.54818081 0.47094448 1.41146733 3.15797523
j83 j84 j85 j86 j87 j88
-0.31840282 4.54551424 2.73270388 4.53077763 0.04227635 -0.41842378
j89 j90 j91 j92 j93 j94
2.80179002 2.87175836 0.86651171 0.05013727 2.01313958 3.44676954
j95 j96 j97 j98 j99 j100
3.75661190 1.61788235 3.17673038 3.36024303 3.27607787 2.52676938
$u18250
j71 j72 j73 j74 j75 j76 j77 j78
7.472495 5.684735 4.356324 6.802610 6.861523 5.663692 6.954942 5.545492
j79 j80 j81 j82 j83 j84 j85 j86
6.280012 4.867291 6.126106 8.071872 4.545116 8.113706 7.280704 7.198826
j88 j89 j90 j91 j93 j94 j95 j96
5.335152 5.902523 6.992913 5.652798 6.224675 7.582607 7.328307 6.075204
j97 j98 j99 j100
6.718550 6.437228 7.010913 6.101415
Compare the results of our two models. How where they different? Did the same jokes get predicted? You also have the power to find the recommended jokes (you can index into JesterJokes with the given number).
Here is u18250’s top recommended joke:
JesterJokes[84] j84
"Q: What is the difference between Mechanical Engineers and Civil Engineers? A: Mechanical Engineers build weapons, Civil Engineers build targets."
What a knee-slapper!
For a recommender system, inference practically flies out the window. We are going to focus our attention on the prediction quality. As with other matters of prediction, the classic root mean squared error will be serving as our determiner of quality.
First we will create our training/testing split.
e = evaluationScheme(Jester5k[1:1000], method = "split",
train = 0.9, given = 15, goodRating = 5)
eEvaluation scheme with 15 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.900
Good ratings: >=5.000000
Data set: 1000 x 100 rating matrix of class 'realRatingMatrix' with 72358 ratings.
We can now train a user-based model:
ubcfTrain = Recommender(getData(e, "train"), "UBCF")And an item-based model:
ibcfTrain = Recommender(getData(e, "train"), "IBCF")Now we can make our predictions:
ubcfPredict = predict(ubcfTrain, getData(e, "known"), type="ratings")
ibcfPredict = predict(ibcfTrain, getData(e, "known"), type="ratings")You can explore these just like we did earlier.
Finally, we can take a look at our errors:
rbind(UBCF = calcPredictionAccuracy(ubcfPredict, getData(e, "unknown")),
IBCF = calcPredictionAccuracy(ibcfPredict, getData(e, "unknown"))) RMSE MSE MAE
UBCF 4.705376 22.14056 3.697923
IBCF 5.343162 28.54938 4.141209
Since we want as little error as possible, we can conclude that the user-based collaborative filter performed better on root mean squared error, mean squared error, and mean average error.
Words are everywhere. Believe it or not, you are reading words right now! Given our penchant for taking things and making numbers out of them, you are probably already guessing that we can somehow make words tell a story with numbers. If that is what you are guessing, then you are absolutely correct.
Before we can even begin to dive into analyzing text, we must first process the text. Processing text involves several steps that will be combined in various ways, depending on what we are trying to accomplish.
Tense aside, are jumped, jump, and jumping the same thing? Yes, but what if we compare the actual strings? On a string comparison side, are they the same? No. We have a string with 6, 4, and 7 characters, respectively.
What if we remove the suffixes, “ed” and “ing” – we are left with three instances of “jump”? Now we have something that is equivalent in meaning and in a string sense. This is the goal of stemming.
Let’s take a look to see how this works (you will need to install tm and SnowballC first):
jumpingStrings = c("jump", "jumping", "jumped", "jumper")
tm::stemDocument(jumpingStrings)[1] "jump" "jump" "jump" "jumper"
We got exactly what we expected, right? You might have noticed that “jumper” did not get stemmed. Do you have any idea why? Let’s think through it together. “Jump”, “jumping”, and “jumped” are all verbs related to the act of jumping. “Jumper”, on the other hand, is a person who jumps – it is a noun. Martin Porter’s stemming algorithm works incredibly well!
Hopefully, this makes conceptual sense; however, we also need to understand why we need to do it. In a great many text-based methods, we are going to create a matrix that keeps track of every term (i.e., word) in every document – this is know as a document-term matrix. If we know that “jump”, “jumping”, and “jumped” all refer to the same thing, we want it just represented once within our document-term matrix.
Shall we take a look?
library(tm)
documents = c("I like to jump.",
"I have jumped my whole life.",
"Jumping is in my blood.",
"I am a jumper.")
documentsCorp = tm::SimpleCorpus(VectorSource(documents))
documentsDTM = DocumentTermMatrix(documentsCorp)
inspect(documentsDTM)<<DocumentTermMatrix (documents: 4, terms: 9)>>
Non-/sparse entries: 9/27
Sparsity : 75%
Maximal term length: 7
Weighting : term frequency (tf)
Sample :
Terms
Docs blood have jump jumped jumper jumping life like whole
1 0 0 1 0 0 0 0 1 0
2 0 1 0 1 0 0 1 0 1
3 1 0 0 0 0 1 0 0 0
4 0 0 0 0 1 0 0 0 0
We can see that without stemming, we have 9 terms (things like “I”, “a”, and “to” get removed automatically). Let’s do some stemming now:
documentsStemmed = stemDocument(documents)
documentsStemmed[1] "I like to jump." "I have jump my whole life."
[3] "Jump is in my blood." "I am a jumper."
And now the document-term matrix:
stemmedDocCorp = tm::SimpleCorpus(VectorSource(documentsStemmed))
stemmedDocDTM = DocumentTermMatrix(stemmedDocCorp)
inspect(stemmedDocDTM)<<DocumentTermMatrix (documents: 4, terms: 7)>>
Non-/sparse entries: 9/19
Sparsity : 68%
Maximal term length: 6
Weighting : term frequency (tf)
Sample :
Terms
Docs blood have jump jumper life like whole
1 0 0 1 0 0 1 0
2 0 1 1 0 1 0 1
3 1 0 1 0 0 0 0
4 0 0 0 1 0 0 0
If we are trying to find documents that are covering similar content or talking about similar things, this document-term matrix will help to draw better conclusions, because it is clear that the first three documents are talking about the act of jumping and this document-term matrix reflects that.
This is going to be especially important for next week’s topic: topic models.
Some words do us very little good: articles, prepistions, and very high frequency words. These are all words that need to be removed. Fortunately, you don’t have to do this on your own – a great many dictionaries exist that contain words ready for removal.
tm::stopwords("en") [1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
[11] "yours" "yourself" "yourselves" "he" "him"
[16] "his" "himself" "she" "her" "hers"
[21] "herself" "it" "its" "itself" "they"
[26] "them" "their" "theirs" "themselves" "what"
[31] "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being"
[46] "have" "has" "had" "having" "do"
[51] "does" "did" "doing" "would" "should"
[56] "could" "ought" "i'm" "you're" "he's"
[61] "she's" "it's" "we're" "they're" "i've"
[66] "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll"
[76] "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
[86] "haven't" "hadn't" "doesn't" "don't" "didn't"
[91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
[96] "cannot" "couldn't" "mustn't" "let's" "that's"
[101] "who's" "what's" "here's" "there's" "when's"
[106] "where's" "why's" "how's" "a" "an"
[111] "the" "and" "but" "if" "or"
[116] "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about"
[126] "against" "between" "into" "through" "during"
[131] "before" "after" "above" "below" "to"
[136] "from" "up" "down" "in" "out"
[141] "on" "off" "over" "under" "again"
[146] "further" "then" "once" "here" "there"
[151] "when" "where" "why" "how" "all"
[156] "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no"
[166] "nor" "not" "only" "own" "same"
[171] "so" "than" "too" "very"
Removing stopwords takes little effort!
documents = c("I like to jump.",
"I have jumped my whole life.",
"Jumping is in my blood.",
"I am a jumper.")
tm::removeWords(documents, words = stopwords("en"))[1] "I like jump." "I jumped whole life."
[3] "Jumping blood." "I jumper."
We can even include custom stopwords:
tm::removeWords(documents, words = c("blood", stopwords("en")))[1] "I like jump." "I jumped whole life."
[3] "Jumping ." "I jumper."
Stemming and stopword removal will go a long way to helping us process and clean our text. There are times, however, when you need complete control over what you remove. You might need to remove punctuation and other symbols, in addition to a wide array of things that come up in typical analyses (emoticons and wacky encoding characters!). When those cases pop up, you might need to make use of regular expressions to help. You should have already learned about regular expressions, but be prepared to circle back to them constantly when engaging in any matter of data science shenanigans!
Many text processors will deal with hyphenated words. If, however, your data is coming from the web, then what looks like a hyphen might not really be a hyphen!
exampleStrings = c("This is a regularly-occuring hyphen",
"This is something that more closely--resembles a manually-produced em dash",
"This is something that came from the darkest recesses of ill—conceived web design")We can turn to our regular expression to see what we can do:
gsub(pattern = "-", replacement = " ", exampleStrings)[1] "This is a regularly occuring hyphen"
[2] "This is something that more closely resembles a manually produced em dash"
[3] "This is something that came from the darkest recesses of ill—conceived web design"
Using gsub and just specifying a regular “-” will remove just about everything, except for whatever is in the last sentence. What is it? It looks like a hyphen, but it is not. To remove it, we will need to get creative:
gsub(pattern = "(?!\\w)\\S", " ", exampleStrings, perl = TRUE)[1] "This is a regularly occuring hyphen"
[2] "This is something that more closely resembles a manually produced em dash"
[3] "This is something that came from the darkest recesses of ill conceived web design"
By using some pattern matching, we are able to get out everything between words, no matter what it might be. This particular little bit of regex is using what is known as a negative lookahead. We have our main expression, the non-space character, \S. Next, we put an expression in front of our main expression (this is the chunk in the parenthesis that starts with “?!”). When we read this, we would say, “don’t match any word characters and then find non-whitespace characters.”
While this is certainly a contrived example and would need additional robustness testing, it goes to show you that working with interesting data will yield interesting problems. I have been telling you all along that nothing is ever as it seems.
There are several R packages that will help us process text. The tm package is popular and automates most of our work. You already saw how we use the stemming and stopword removal functions, but tm is full of fun stuff and allows for one pass text processing.
documents = c("I like to jump.",
"I have jumped my whole life.",
"Jumping is in my blood.",
"I am a jumper.")
documentCorp = SimpleCorpus(VectorSource(documents))
stopWordRemoval = function(x) {
removeWords(x, stopwords("en"))
}
textPrepFunctions = list(removePunctuation,
stemDocument,
stopWordRemoval,
removeNumbers,
stripWhitespace)
tm_map(documentCorp, FUN = tm_reduce, tmFuns = textPrepFunctions)Once you get your text tidied up (or even before), you can produce some visualizations!
library(dplyr)
library(tidytext)
library(wordcloud2)
txExecutions = read.csv("txEx.csv", stringsAsFactors = FALSE)
txExecutions %>%
dplyr::select(correctedStatements, inmateNumber) %>%
unnest_tokens(word, correctedStatements) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(n > 25) %>%
na.omit() %>%
wordcloud2(shape = "cardioid")Sentiment analysis is commonly used when we want to know the general feelings of what someone has written or said. Sentiment analysis is commonly seen applied to Twitter and other social media posts, but we can use it anywhere where people have written/said something (product reviews, song lyrics, final statements).
Sentiment can take many different forms: positive/negative affect, emotional states, and even financial contexts.
Let’s take a peak at some simple sentiment analysis.
Let’s consider the following statements:
library(tidytext)
statement = "I dislike programming, but I really love R."
tokens = data_frame(text = statement) %>%
unnest_tokens(tbl = ., output = word, input = text)
tokens# A tibble: 8 x 1
word
<chr>
1 i
2 dislike
3 programming
4 but
5 i
6 really
7 love
8 r
From there, we get every individual token. This brings us to a quick, but necessary, detour. When we are talking about text, tokens are simply elements that have some general meaning. We typically associate tokens with individual words, but we could go deeper than that (spaces, punctuation, n-grams). While we won’t dive into them too deeply, n-grams are also interesting. Just like our typical notion of n, an n-gram is an n length group of words. We can set n to be anything, but we would typically look at 2-gram and 3-grams chunks.
tokenizers::tokenize_ngrams(statement, n = 2)[[1]]
[1] "i dislike" "dislike programming" "programming but"
[4] "but i" "i really" "really love"
[7] "love r"
These are helpful for looking at frequently occuring combinations of words. They are also going to be helpful for us in just a bit.
Now back to the matter at hand.
Now, we can compare the tokens within our statement to some pre-defined dictionary of positive and negative words.
library(tidyr)
tokens %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)# A tibble: 1 x 3
negative positive sentiment
<dbl> <dbl> <dbl>
1 1 1 0
When we use Bing’s dictionary, we see that we get one positive word (love) and negative word (dislike) with a neutral overall sentiment (a sentiment of 0 would indicate neutrality, while anything above 0 has an increasing amount of positivity and anything below 0 has an increasing amount of negativity).
Do you think that disklike and love are of the same magnitude? If I had to make a wild guess, I might say that love is stronger than dislike. Let’s switch out our sentiment library to get something with a little better notion of polarity magnitute.
tokens %>%
inner_join(get_sentiments("afinn"))# A tibble: 2 x 2
word score
<chr> <int>
1 dislike -2
2 love 3
Now this looks a bit more interesting! “Love” has a stronger positive polarity than “dislike” has negative polarity. So, we could guess that we would have some positive sentiment.
If we divide the sum of our word sentiments by the number of words within the dictionary, we should get an idea of our sentences overall sentiment.
tokens %>%
inner_join(get_sentiments("afinn")) %>%
summarize(n = nrow(.), sentSum = sum(score)) %>%
mutate(sentiment = sentSum / n)# A tibble: 1 x 3
n sentSum sentiment
<int> <int> <dbl>
1 2 1 0.5
Our sentiment of .5 tells us that our sentence is positive, even if only slightly so.
While these simple sentiment analyses provide some decent measures to the sentiment of our text, we are ignoring big chunks of our text by just counting keywords.
For example, it is probably fair to say that “really love” is stronger than just “love”. We might want to switch over to some techniques that consider n-grams and other text features to calculate sentiment.
When we use sentiment analysis that is aware of context, valence (“love” is stronger than “like”), modifiers (e.g., “really love”), and adversative statements (“but,…”, “however,…”), we get a better idea about the real sentiment of the text.
We will use the jockers sentiment library, but many more available. Depending on your exact needs, there are some dictionaries designed for different applications. A prime example, and one that should be near to our hearts, is the Loughran and McDonald dictionary for financial documents (e.g., SEC 10K filings). Tim Loughran and Bill McDonald are superstars in the Finance Department at Notre Dame! There are dictionaries designed to measure certain attitudes and opinions (e.g., disgust, excitedness, sadness) and even dictionaries to measure emoji sentiment.
Before we engage in our whole sentiment analysis, let’s take a look at a few things.
Here is the dictionary that jockers will use.
lexicon::hash_sentiment_jockers x y
1: abandon -0.75
2: abandoned -0.50
3: abandoner -0.25
4: abandonment -0.25
5: abandons -1.00
---
10734: zealous 0.40
10735: zenith 0.40
10736: zest 0.50
10737: zombie -0.25
10738: zombies -0.25
You might want to use View() to get a complete look at what is happening in there.
We should also take a peak at our valence shifters:
lexicon::hash_valence_shifters x y
1: absolutely 2
2: acute 2
3: acutely 2
4: ain't 1
5: aint 1
---
136: whereas 4
137: won't 1
138: wont 1
139: wouldn't 1
140: wouldnt 1
With all of that out of the way, let’s get down to the matter at hand:
library(sentimentr); library(lexicon); library(magrittr)
statement = "I dislike programming, but I really love R."
sentiment(statement, polarity_dt = lexicon::hash_sentiment_jockers) element_id sentence_id word_count sentiment
1: 1 1 8 0.516188
We can see that we get a much stronger sentiment score when we include more information within the sentence. While the first part of our sentence starts out with a negative word (dislike has a sentiment value of -1.6), we have an adversarial “but” that will downweight whatever is in the initial phrase and then we will have the amplified (from “really”) sentiment of “love” (with a weight of 3.2 in our dictionary).
With all of this together, we get a much better idea about the sentiment of our text.
While the text that we have used so far serves its purpose as an example quite well, we can always take a look at other written words.
Let’s consider the following data:
txExecutionsShort = txExecutions %>%
dplyr::select(inmateNumber, correctedStatements) %>%
filter(inmateNumber == 529|
inmateNumber == 843|
inmateNumber == 714|
inmateNumber == 999253)
sentiment(get_sentences(txExecutionsShort),
polarity_dt = lexicon::hash_sentiment_jockers) %>%
group_by(inmateNumber) %>%
summarize(meanSentiment = mean(sentiment))# A tibble: 4 x 2
inmateNumber meanSentiment
<int> <dbl>
1 529 0.0972
2 714 0.0763
3 843 0.561
4 999253 0.0121
Sentiment analysis is always a handy tool to have around. You might also want to explore other descriptive aspects of your text.
The koRpus package allows for all types of interesting types descriptives. There are a great number of readability and lexical diversity statistics (Fucks is likely my favorite).
We need to tokenize our text in a manner that will please koRpus.
library(koRpus)
statementTokens = lapply(txExecutionsShort$correctedStatements, function(x) tokenize(x,
format = "obj", lang = "en"))
statementTokens[[1]]
token tag lemma lttr wclass
1 What word.kRp 4 word
2 is word.kRp 2 word
3 about word.kRp 5 word
4 to word.kRp 2 word
5 transpire word.kRp 9 word
6 in word.kRp 2 word
[...]
172 well word.kRp 4 word
173 by word.kRp 2 word
174 all word.kRp 3 word
175 T.D.C. abbr.kRp 6 abbreviation
176 personnel word.kRp 9 word
177 . .kRp 1 fullstop
desc stop stem
1 Word (kRp internal) <NA> <NA>
2 Word (kRp internal) <NA> <NA>
3 Word (kRp internal) <NA> <NA>
4 Word (kRp internal) <NA> <NA>
5 Word (kRp internal) <NA> <NA>
6 Word (kRp internal) <NA> <NA>
172 Word (kRp internal) <NA> <NA>
173 Word (kRp internal) <NA> <NA>
174 Word (kRp internal) <NA> <NA>
175 Abbreviation (kRp internal) <NA> <NA>
176 Word (kRp internal) <NA> <NA>
177 Sentence ending punctuation (kRp internal) <NA> <NA>
[[2]]
token tag lemma lttr wclass
1 I word.kRp 1 word
2 ' ''kRp 1 punctuation
3 m word.kRp 1 word
4 sorry word.kRp 5 word
5 for word.kRp 3 word
6 what word.kRp 4 word
[...]
14 this word.kRp 4 word
15 . .kRp 1 fullstop
16 Jesus word.kRp 5 word
17 forgive word.kRp 7 word
18 me word.kRp 2 word
19 . .kRp 1 fullstop
desc stop stem
1 Word (kRp internal) <NA> <NA>
2 Quote (kRp internal) <NA> <NA>
3 Word (kRp internal) <NA> <NA>
4 Word (kRp internal) <NA> <NA>
5 Word (kRp internal) <NA> <NA>
6 Word (kRp internal) <NA> <NA>
14 Word (kRp internal) <NA> <NA>
15 Sentence ending punctuation (kRp internal) <NA> <NA>
16 Word (kRp internal) <NA> <NA>
17 Word (kRp internal) <NA> <NA>
18 Word (kRp internal) <NA> <NA>
19 Sentence ending punctuation (kRp internal) <NA> <NA>
[[3]]
token tag lemma lttr wclass
1 I word.kRp 1 word
2 am word.kRp 2 word
3 so word.kRp 2 word
4 glad word.kRp 4 word
5 I word.kRp 1 word
6 found word.kRp 5 word
[...]
10 am word.kRp 2 word
11 so word.kRp 2 word
12 happy word.kRp 5 word
13 for word.kRp 3 word
14 it word.kRp 2 word
15 . .kRp 1 fullstop
desc stop stem
1 Word (kRp internal) <NA> <NA>
2 Word (kRp internal) <NA> <NA>
3 Word (kRp internal) <NA> <NA>
4 Word (kRp internal) <NA> <NA>
5 Word (kRp internal) <NA> <NA>
6 Word (kRp internal) <NA> <NA>
10 Word (kRp internal) <NA> <NA>
11 Word (kRp internal) <NA> <NA>
12 Word (kRp internal) <NA> <NA>
13 Word (kRp internal) <NA> <NA>
14 Word (kRp internal) <NA> <NA>
15 Sentence ending punctuation (kRp internal) <NA> <NA>
[[4]]
token tag lemma lttr wclass
1 Yes word.kRp 3 word
2 I word.kRp 1 word
3 do word.kRp 2 word
4 . .kRp 1 fullstop
5 There word.kRp 5 word
6 has word.kRp 3 word
[...]
389 I word.kRp 1 word
390 ' ''kRp 1 punctuation
391 m word.kRp 1 word
392 finished word.kRp 8 word
393 talking word.kRp 7 word
394 . .kRp 1 fullstop
desc stop stem
1 Word (kRp internal) <NA> <NA>
2 Word (kRp internal) <NA> <NA>
3 Word (kRp internal) <NA> <NA>
4 Sentence ending punctuation (kRp internal) <NA> <NA>
5 Word (kRp internal) <NA> <NA>
6 Word (kRp internal) <NA> <NA>
389 Word (kRp internal) <NA> <NA>
390 Quote (kRp internal) <NA> <NA>
391 Word (kRp internal) <NA> <NA>
392 Word (kRp internal) <NA> <NA>
393 Word (kRp internal) <NA> <NA>
394 Sentence ending punctuation (kRp internal) <NA> <NA>
lapply(statementTokens, function(x) readability(x, index = "Flesch.Kincaid", quiet = TRUE))[[1]]
Flesch-Kincaid Grade Level
Parameters: default
Grade: 5.83
Age: 10.83
Text language: en
[[2]]
Flesch-Kincaid Grade Level
Parameters: default
Grade: 0.56
Age: 5.56
Text language: en
[[3]]
Flesch-Kincaid Grade Level
Parameters: default
Grade: 2.51
Age: 7.51
Text language: en
[[4]]
Flesch-Kincaid Grade Level
Parameters: default
Grade: 1.45
Age: 6.45
Text language: en
We already know that text is everywhere and our sentiment analysis was a reasonable crack at extracting some meaning from text. A common line of inquiry relates to what is expressed within the text. In traditionally social sciences, this was done through some form of content analysis – a coding scheme was developed by researchers, people read each comment and assigned it a code, some agreement statistics were computed, and any discrepencies were hashed out by the researchers. When this was the only option, it was servicable. Could you do this for thousands of tweets? With time, maybe. Could you do this with a few million articles? No. This is where topic models, and latent dirichlet allocation, come to save the day.
What follows borrows heavily from documents produced by Michael Clark and by a collab between Michael Clark and Seth Berry.
We will start with a brief demonstration of a standard latent dirichlet allocation (LDA) for topic modeling. The basic idea is to first generate some documents based on the underlying model, and then we’ll use the topicmodels package to recover the topics via LDA.
Suffice it to say, one can approach this in (at least) one of two ways. In one sense, LDA is a dimension reduction technique, much like the family of techniques that includes PCA, factor analysis, non-negative matrix factorization, etc. We will take a whole lot of terms, loosely defined, and boil them down to a few topics. In this sense, LDA is akin to discrete PCA. Another way to think about this is more from the perspective of factor analysis, where we are keenly interested in interpretation of the result, and want to know both what terms are associated with which topics, and what documents are more likely to present which topics. The following is the plate diagram and description for standard LDA from Blei, Jordan, and Ng (2003).
Both z and w are from a multinomial draw based on the θ and β distributions respectively. The key idea is that, to produce a given document, one draws a topic, and given the topic, words are drawn from it.
In the standard setting, to be able to conduct such an analysis from text, one needs a document-term matrix, where rows represent documents, and columns terms. Terms are typically words but could be any n-gram of interest. In practice, this is where you’ll spend most of your time, as text is never ready for analysis, and must be scraped, converted, stemmed, cleaned etc. We will initially create θ and β noted above, then given those, draw topics and words given topics based on the multinomial distribution.
We begin the simulation by creating topic probabilities. There will be k = 3 topics. Half of our documents will have probabilities of topics for them (\(\theta1\)) which will be notably different from the other half (\(\theta2\)). Specifically, the first half will show higher probability of topic “A” and “B”, while the second half of documents show higher probability of topic “C”. What we’ll end up with here is an m X k matrix of probabilities θ where each m document has a non-zero probability for each k topic. Note that in the plate diagram, these would come from a Dirichlet(α) draw rather than be fixed like this, but hopefully this will make things clear starting out.
library(tidyverse)
nDocs = 500 # Number of documents
wordsPerDoc = rpois(nDocs, 100) # Total words/terms in a document
thetaList = list(c(A = .60, B = .25, C = .15), # Topic proportions for first and second half of data
c(A = .10, B = .10, C = .80)) # These values represent a Dir(alpha) draw
theta_1 = t(replicate(nDocs / 2, thetaList[[1]]))
theta_2 = t(replicate(nDocs / 2, thetaList[[2]]))
theta = rbind(theta_1, theta_2) With topic probabilities in hand, we’ll draw topic assignments from a categorical distribution.
firsthalf = 1:(nDocs / 2)
secondhalf = (nDocs / 2 + 1):nDocs
Z = apply(theta, 1, rmultinom, n = 1, size = 1) # draw topic assignment
rowMeans(Z[, firsthalf]) # roughly equal to theta_1[1] 0.596 0.256 0.148
Now, we can see which is the most likely topic.
z = apply(Z, 2, which.max) We have our list of documents and each document’s topic assignment.
Next we need the topics themselves. Topics are probability distributions of terms, and in what follows we’ll use the Dirichlet distribution to provide the prior probabilities for the terms. With topic A, we’ll make the first ~40% of terms have a higher probability of occurring, the last ~40% go with topic C, and the middle more associated with topic B. To give a sense of the alpha settings, alpha = c(8, 1, 1) would result in topic probabilities of .8, .1, .1, as would alpha = c(80, 10, 10). We’ll use the gtools package for the rdirichlet function.
nTerms = max(wordsPerDoc)
breaks = quantile(1:nTerms, c(.4,.6,1)) %>% round()
cuts = list(1:breaks[1], (breaks[1] + 1):breaks[2],
(breaks[2] + 1):nTerms)
library(gtools)
B_k = matrix(0, ncol = 3, nrow = nTerms)
B_k[,1] = rdirichlet(n=1, alpha=c(rep(10, length(cuts[[1]])), # topics for 1st 40% of terms
rep(1, length(cuts[[2]])),
rep(1, length(cuts[[3]]))))
B_k[,2] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])), # topics for middle 20%
rep(10, length(cuts[[2]])),
rep(1, length(cuts[[3]]))))
B_k[,3] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])), # topics for last 40%
rep(1, length(cuts[[2]])),
rep(10, length(cuts[[3]]))))Here is a visualization of the term-topic matrix, where the dark represents terms that are notably less likely to be associated with a particular topic.
library(ggplot2)
as.data.frame(B_k) %>%
mutate(document = 1:nrow(.)) %>%
tidyr::gather(key = topic, value = prob, -document) %>%
ggplot(., aes(topic, document, color = prob)) +
geom_tile()Remember how we specified this – we assigned 40% to topic 1, 40% to topic 3, and left the middle 20% to topic 2. This visualization clearly shows this.
Now, given the topic assignment, we draw words for each document according to its size via a multinomial draw, and with that, we have our document-term matrix. However, we can also think of each document as merely a bag of words, where order, grammar etc. is ignored, but the frequency of term usage is retained.
wordlist_1 = sapply(1:nDocs,
function(i) t(rmultinom(1, size = wordsPerDoc[i], prob = B_k[, z[i]])),
simplify = FALSE)
# smash to doc-term matrix
dtm_1 = do.call(rbind, wordlist_1)
colnames(dtm_1) = paste0('word', 1:nTerms)
# bag of words representation
wordlist_1 = lapply(wordlist_1, function(wds) rep(paste0('word', 1:length(wds)), wds))If you print out the wordlist_1 object, you will see the words asssociated with each document.
Now with some theory under our belt, we can take a look at analyzing some data. Just to keep everything light, we will be looking at the Last Statement of inmates executed in Texas.
Just like our sentiment analysis, there is a fair chunk of cleaning to do:
library(stm)
executions = read.csv("txEx.csv", stringsAsFactors = FALSE)
executionsText = textProcessor(documents = executions$correctedStatements,
metadata = executions)Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Stemming...
Creating Output...
executionsPrep = prepDocuments(documents = executionsText$documents,
vocab = executionsText$vocab,
meta = executionsText$meta)Removing 1240 of 2305 terms (1240 of 12857 tokens) due to frequency
Your corpus now has 425 documents, 1065 terms and 11617 tokens.
The stm package has some pretty nice facilities for determining a number of topics:
kTest = searchK(documents = executionsPrep$documents,
vocab = executionsPrep$vocab,
K = c(3, 4, 5, 10, 20), verbose = FALSE)
plot(kTest)The 4 plots that are returned are going to try to help us determine the best number of topics to take. I like to focus on semantic coherence (how well the words hang together) and the residuals. We want to have low residual and high semantic coherence. The residuals definitely take a sharp dive as we increase K. If we consider one of the major assumptions of LDA, we could almost always guess that we would need a great number of topics (i.e., if every topic that has ever existed, existed before writing, then we could have a huge numebr of topics). Our coherence seems to do pretty well with some of the lower numbers (not entirely surprising). With all of these together, we can settle on 5 topics for our subsequent analyses.
It is worthwhile to note that we can also do model selection with the stm package, but that is some work that will be best done if you want some additional playtime.
With our 5 topics, we can start our actual model:
topics5 = stm(documents = executionsPrep$documents,
vocab = executionsPrep$vocab,
K = 5, verbose = FALSE)We can get a lot of output from our model, but we can focus on the expected topic proportions plot:
plot(topics5)We are essentially looking at the proportion of each topic, with a few of the highest probability words.
This is great, but it is really fun to see what emerges from the topics.
labelTopics(topics5)Topic 1 Top Words:
Highest Prob: love, thank, life, support, peopl, way, brother
FREX: live, allah, appreci, thank, bye, row, learn
Lift: cousin, maria, mauri, open, pam, within, afford
Score: thank, love, bye, allah, support, appreci, live
Topic 2 Top Words:
Highest Prob: know, love, want, yall, say, sorri, dont
FREX: alright, watch, dont, worri, mom, tell, care
Lift: alba, ass, daddi, fall, fault, folk, forc
Score: yall, dont, love, kid, sorri, tell, care
Topic 3 Top Words:
Highest Prob: will, kill, stay, peopl, innoc, murder, get
FREX: kill, innoc, murder, must, stop, justic, fight
Lift: adam, bed, church, dna, error, forgotten, heard
Score: kill, black, polic, innoc, must, evid, america
Topic 4 Top Words:
Highest Prob: famili, like, hope, peac, god, one, friend
FREX: receiv, famili, friend, peac, victim, like, mother
Lift: behalf, behold, final, grate, knowledg, known, lay
Score: peac, like, famili, god, rejoic, unto, propheci
Topic 5 Top Words:
Highest Prob: forgiv, lord, will, god, sorri, jesus, ask
FREX: forgiv, sin, thi, holi, prais, amen, accept
Lift: accept, action, anointest, bread, comfort, etern, everlast
Score: forgiv, thi, holi, thou, lord, jesus, amen
Let’s focus on the frex words (they occur frequently within the topic and are exclusive to that topic) and the highest probability words (i.e., the words that have the highest probability of occuring within that topic). The Lift and Score words (just a few different ways of weighting occurance) can be useful, but are a bit less intuitive than the other two. Let’s put some names to the topics. Topic 1 is likely expressing “thanks for the love and the support”, topic 2 is something along the lines of “everything is going to be alright”, topic 3 is “you are murdering an innocent man”, topic 4 is probably “I hope your family finds peace”, and topic 5 is “I hope God forgives me”.
We can look at statements that have a high probability of being associated with each topic:
findThoughts(topics5, texts = executionsPrep$meta$correctedStatements, n = 1)
Topic 1:
I would like to extend my love to my family members and my relatives for all of the love and support you have showed me. I extend my special love to my daughter who I love greatly. I hope that you forever remember me. I hope that you will always cherish the love and the strength that I have provided you. My love for you will remain with you winthin your heart and in part of your soul. As to all my brothers I love you all with all of my heart. But during your time of departure from this earth plane you will have to face the judgement of God for the lack of love you have shown my aunt and my cousins. We were never brought up to be that way. As you know our parents brought us up to love one another no matter what. There was no love showed to my aunt or none of my cousins. I can forgive you all but you must ask forgiveness from God for how you have hurt our aunt and our family. I leave now at this moment to join my parents and my only sister whose lives were not taken by me. To all the fellows on death row, I thank you for the love that you have shown me and for the strength that you provided me. You all keep your heads up. As for my attorney's I thank you all for being there for me. As defense attorneys you have shown me a lot strength. May my love touch each one of you all's souls as I leave this body.
Topic 2:
Hey, how y'all doing out there? I done lost my voice. Y'all be strong now, alright? Don, thanks man. I love you, Gloria, always baby. That's all I got to say. Hey, don't y'all worry about me, okay?
Topic 3:
I would like to say first of all the real violent crimes in this case are acts committed by James Boswell and Clay Morgan Gaines. We have the physical evidence to prove fabrication and cover-up. The people responsible for killing me will have blood on their hands for an unprovoked murder. I am not guilty; I acted in self-defense and reflex in the face of a police officer who was out of control. James Boswell had his head beat in; possibly due to this he had problems. My jurors had not heard about that. They did not know he had suffered a head injury from the beating by a crack dealer five months earlier; that he was filled with anger and wrote an angry letter to the Houston Chronicle. He expressed his frustration at the mayor, police chief and fire chief. He was mad at the world. Three and a half months before I worked on a deal with the DEA, the informant was let off. At the moment he left the courtroom, he became angry with me; Officer Boswell was upset about this. Officer Boswell and an angry woman were in the police car and they were talking in raised voices. In other words, Officer Boswell was angry at the time I walked up. Officer Boswell may have reacted to the...
Topic 4:
The following is the personal final statement of and by Michael L. McBride. The Beatitudes: Jesus lifted up his eyes on His disciples, and said, "Blessed be the poor: for yours is the kingdom of God. Blessed are ye that hunger now: for ye shall be filled. Blessed are ye that weep now: for ye shall laugh. Blessed are ye, when men shall hate you, and they shall separate you from their company, and shall reproach you, and cast out your name as evil for the Son of Man's sake. Rejoice ye in that day, and leap for joy: for behold, your reward is great in Heaven: for in the like manner did their fathers unto the prophets. But woe unto you that are rich! for ye have received your consolation. Woe unto you that are full! for ye shall hunger. Woe unto you that laugh now! for ye shall moan and weep. Woe unto you, when all men shall speak well of you! for so did their fathers to the false prophets. The supremacy of love over gifts: I Corinthians, Chapter 13: 4-8: Love is patient, love is kind, and is not jealous, love does not brag and is no arrogant, does not act unbecoming; it does not seek its own, is not provoked, does not take into account a wrong suffered, does not rejoice in unrighteousness, but rejoices with the truth; bears all things, believes all things, hopes all things, endures all things. Love never fails; but if there are gifts of prophecy, they will be done away; if there tongues, they will cease. Now abide faith, hope, love, these three: but the greatest of these is love. Poem: Do not stand at my grave and weep, I am not there I do not sleep. I am the diamond glints in the snow, I am the sunlight on the ripened grain. I am the gentle autumn rain. When you awaken in the morning's hush, I am the swift uplifting rush of quiet birds in circled flight, I am the soft stars that shine at night. Do not stand at my grave and cry, I am not there. I did not die. Signed Michael L. McBride #903 May 11, 2000 Huntsville, Texas
Topic 5:
Mama Isabel told me to tell you hello. Holy, holy, holy! Lord God Almighty! Early in the morning our song shall rise to Thee; Holy, holy, holy, merciful and mighty! God in three Persons, blessed Trinity. Holy, holy, holy! Merciful and mighty. All Thy works shall praise Thy name, in earth, and sky, and sea; Holy, holy, holy, merciful and mighty! God in three Persons, blessed Trinity. Oh, our Father who art in heaven, holy, holy, holy be Thy name. Thy kingdom come, Thy will be done, on earth as it is in heaven. Give us this day our daily bread and forgive us our sin as we forgive our debtors. Lead us not into temptation, but deliver us from evil, for Thine is the kingdom and the power and the glory forever and ever. Now, Father, into Thy hands I commit my spirit. Amen.
We can see that the story we put to our topics makes sense, given what we see in the actual texts.
Social Media
Social media is ripe for data (being mindful of Terms of Service, of course). Post history, both what was posted and in frequency, are excellent data points. Social networks are also a popular research endeavor.